Urllib.request vs. Requests.get

Tue Dec 7 06:35:06 EST 2021

Hey,

I am currently working on a simple program which scrapes text from webpages
via a URL, then segments it (with Spacy).

I’m trying to refine my program to use just the right tools for the job,
for each of the steps.

Requests.get works great, but I’ve seen people use urllib.request.urlopen()
in some examples. It appealed to me because it seemed lower level than
requests.get, so it just makes the program feel leaner and purer and more
direct.

However, requests.get works fine on this url:

https://juno.sh/direct-connection-to-jupyter-server/

But urllib returns a “403 forbidden”.

Could anyone please comment on what the fundamental differences are between
urllib vs. requests, why this would happen, and if urllib has any option to
prevent this and get the page source?

Thanks,
Julius