Urllib.request vs. Requests.get

Tue Dec 7 13:13:34 EST 2021

On Wed, Dec 8, 2021 at 4:51 AM Julius Hamilton
<juliushamilton100 at gmail.com> wrote:
>
> Hey,
>
> I am currently working on a simple program which scrapes text from webpages
> via a URL, then segments it (with Spacy).
>
> I’m trying to refine my program to use just the right tools for the job,
> for each of the steps.
>
> Requests.get works great, but I’ve seen people use urllib.request.urlopen()
> in some examples. It appealed to me because it seemed lower level than
> requests.get, so it just makes the program feel leaner and purer and more
> direct.
>
> However, requests.get works fine on this url:
>
> https://juno.sh/direct-connection-to-jupyter-server/
>
> But urllib returns a “403 forbidden”.
>
> Could anyone please comment on what the fundamental differences are between
> urllib vs. requests, why this would happen, and if urllib has any option to
> prevent this and get the page source?
>

*Fundamental* differences? Not many. The requests module is designed
to be easy to use, whereas urllib is designed to be basic and simple.
Not really a fundamental difference, but perhaps indicative.

I'd recommend doing the query with requests, and seeing exactly what
headers are being sent. Most likely, there'll be something that you
need to add explicitly when using urllib that the server is looking
for (maybe a user agent or something). Requests uses Python's logging
module to configure everything, so it should be a simple matter of
setting log level to DEBUG and sending the request.

TBH though, I'd just recommend using requests, unless you specifically
need to avoid the dependency :)

ChrisA