urllib2 - 403 that _should_ not occur.

Sun Jan 11 22:50:28 EST 2009

On Mon, Jan 12, 2009 at 1:25 PM, Philip Semanchuk <philip at semanchuk.com> wrote:
> Oooops, I guess it is my brain that's not working, then! Sorry about that.

Nps.

> I tried your sample and got the 403. This works for me:

(...)

> Some sites ban UAs that look like bots. I know there's a Java-based bot with
> a distinct UA that was really badly-behaved when visiting my server. Ignored
> robots.txt, fetched pages as quickly as it could etc. That was worthy of
> banning. FWIW, when I try the code above with a UA of "funny fish" it still
> works OK, so it looks like the groups.google.com server has it out for UAs
> with Python in them, not just unknown ones.
>
> I'm sure that if you changed wget's UA string to something Pythonic it would
> start to fail too.

My problem that I'm solving and my use-case
is  a tool to periodically check configured RSS
feeds for updates. I was going to use urllib2
to get the data and pass this off to feedparser.parse(...)

Because of the UA problem though (which can
be overcome) - I decided to try a different approach
and use feedparse entirely (which uses urllib internally).

Problem is, feedparser doesn't store the http's
response content anywhere - only the parsed
results - *sigh*.

My solution now is to parse and store the data
I required in a simple object and pickle this to a
set of cached files and compare this against
hashed versions of the content.

cheers
James