Urllib.request vs. Requests.get

Tue Dec 7 13:14:31 EST 2021

Cloudflare, for whatever reason, appears to be rejecting the `User-
Agent` header that urllib is providing:`Python-urllib/3.9`. Using a
different `User-Agent` seems to get around the issue:

import urllib.request

req = urllib.request.Request(
    url="https://juno.sh/direct-connection-to-jupyter-server/",
    method="GET",
    headers={"User-Agent": "Workaround/1.0"},
)

res = urllib.request.urlopen(req)

Paul

On Tue, 2021-12-07 at 12:35 +0100, Julius Hamilton wrote:
> Hey,
> 
> I am currently working on a simple program which scrapes text from
> webpages
> via a URL, then segments it (with Spacy).
> 
> I’m trying to refine my program to use just the right tools for the
> job,
> for each of the steps.
> 
> Requests.get works great, but I’ve seen people use
> urllib.request.urlopen()
> in some examples. It appealed to me because it seemed lower level
> than
> requests.get, so it just makes the program feel leaner and purer and
> more
> direct.
> 
> However, requests.get works fine on this url:
> 
> https://juno.sh/direct-connection-to-jupyter-server/
> 
> But urllib returns a “403 forbidden”.
> 
> Could anyone please comment on what the fundamental differences are
> between
> urllib vs. requests, why this would happen, and if urllib has any
> option to
> prevent this and get the page source?
> 
> Thanks,
> Julius