Urllib.request vs. Requests.get
Paul Bryan
pbryan at anode.ca
Tue Dec 7 13:14:31 EST 2021
Cloudflare, for whatever reason, appears to be rejecting the `User-
Agent` header that urllib is providing:`Python-urllib/3.9`. Using a
different `User-Agent` seems to get around the issue:
import urllib.request
req = urllib.request.Request(
url="https://juno.sh/direct-connection-to-jupyter-server/",
method="GET",
headers={"User-Agent": "Workaround/1.0"},
)
res = urllib.request.urlopen(req)
Paul
On Tue, 2021-12-07 at 12:35 +0100, Julius Hamilton wrote:
> Hey,
>
> I am currently working on a simple program which scrapes text from
> webpages
> via a URL, then segments it (with Spacy).
>
> I’m trying to refine my program to use just the right tools for the
> job,
> for each of the steps.
>
> Requests.get works great, but I’ve seen people use
> urllib.request.urlopen()
> in some examples. It appealed to me because it seemed lower level
> than
> requests.get, so it just makes the program feel leaner and purer and
> more
> direct.
>
> However, requests.get works fine on this url:
>
> https://juno.sh/direct-connection-to-jupyter-server/
>
> But urllib returns a “403 forbidden”.
>
> Could anyone please comment on what the fundamental differences are
> between
> urllib vs. requests, why this would happen, and if urllib has any
> option to
> prevent this and get the page source?
>
> Thanks,
> Julius
More information about the Python-list
mailing list