urllib, urllib2, httplib -- Begging for consolidation?

Thu May 9 11:20:56 EDT 2002

brueckd at tbye.com wrote in message news:<mailman.1020867014.19652.python-list at python.org>...
> On 8 May 2002, Paul Boddie wrote:
> >
> > I think we should stick with the urlopen concept because it's very
> > powerful - just open a URL and pretend that it's a file.
> 
> Oh, I wasn't saying that it's not powerful (it is), just that it's not too
> commonly useful. Obviously YMMV, but for me it has been pretty rare to
> want to open some generic resource and read it. I'm not arguing that we
> should get rid if this functionality, just observing that for me it has
> never been the common case, and doing what has been the common case (http
> connection + extra headers + some data to post) is always more work than 
> it needs to be.

For me, what I've mostly been doing with urllib is to connect to
locations and to download files. Indeed, having this functionality in
the standard library is incredibly useful when a server, for example,
doesn't have an FTP service running, but can make connections to the
outside and has Python (even 1.5.2) installed. So for me, just being
able to connect to resources regardless of protocol is occasionally a
"killer application" of urllib.

Anyway, returning to the point, there is a large amount of overlap
between HTTP and FTP for such simple operations, and the following
URLs, which can be used to access a remote server (with authentication
details provided) and to download a resource, both make sense:

  http://user@myserver:8080/docs/resource
  https://user@myserver:8080/docs/resource
  ftp://user@myserver/pub/docs/resource

Even the following URLs, which I've just made up, share several common
details with those above:

  pop3://user at myserver/INBOX/123
  imap://user@myserver/INBOX/123

I can't remember the exact form of "pop3" (or "pop") and "imap" URLs,
however.

> > The clever design will arise when specialised features of various
> > protocols need to be specified whilst using the general interface,
> 
> But if you're specifying features specific to a certain protocol, why use 
> the general interface to do it? That makes the general interface hacky and 
> cluttered. My argument is that, right now, people use the general 
> interface (urllib) not because they don't know what type of URL they're 
> opening (ftp/http/file/etc) but because the modules somewhat discourage 
> using the other ones. 

But it's more convenient to use urllib to access resources in a
general way. One just doesn't need to know anything about the
protocol, and that's another powerful thing about urllib that I've
just remembered: provided SSL support is compiled into Python, using
"https" URLs is as simple as changing the URL given to urlopen.

Of course, one might want to control "https" connections more closely,
for example, but in many cases it is sufficient for urllib to do the
default thing.

> I don't want to make it sound like too big of a deal, but the use model
> today doesn't make sense: today if a newbie wants the easiest way to open
> an HTTP URL, he should use urllib. If he wants to do something a little
> more complex, he should scrap the urllib code and use httplib. Like I
> said, it's not too big of a deal, but it makes more sense if moving from
> the simple to the more complex case is incremental and based on the same
> code.

It would be interesting to hear what kinds of things force you to deal
with httplib. I can imagine that there are features that are common to
most major protocols - authentication is the most significant example
- and I can imagine that incorporating a general interface to these
features is mostly possible.

> > then there are plenty of other packages which deal with this kind of
> > problem; for example, the DB-API has ways of allowing database-specific
> > functionality to be specified when opening database connections.
> 
> Actually, this is a great example of what I'm saying we need! :) The DB 
> API does _not_ provide you a way to open an unknown database type, but a 
> common way to operate on a database connection once you have one. It would 
> be "powerful" if the DB API let you pass in a string that, among other 
> things, included "oracle", "gadfly", "mysql", etc. to denote database type 
> and it would then go connect to it, but such a feature wouldn't be very 
> *useful* because in practice you almost always *do* know what database 
> type you're connecting to.

Agreed. The JDBC approach of issuing a URL to the driver manager
(which is like what we're talking about here) indeed seems flamboyant
when you still have to specify which driver is actually going to be
used anyway.

> So, you use a specific database module (e.g. 
> DCOracle2) to get a database connection (to which you can pass all sorts 
> of custom information to), after which you can use the connection in a 
> pretty generic way. At the same time, however, the connection object can 
> still expose additional vendor-specific functionality in addition to what 
> is specified in the DB API.

One of the problems with the DB-API as it is today, however, is that
it doesn't go far enough to hide vendor-specific differences.

> A similar approach might work well for the different protocol libraries - 
> go to the appropriate module to open the one you want (setting it up with 
> any protocol-specific information), after which you have a file-like 
> object that your code can use generically. Note that on top of all this 
> somebody could still have the urllib functionality that takes a generic 
> URL, figures out the appropriate protocol, and returns the correct 
> "connection" object for your code to use, but such a top-level function 
> would *not* be the place to start adding protocol-specific options.

I believe that specialisation may be introduced into the objects
returned by whatever function was used to create them (even if the
creation was delegated to other functions, classes or modules). For
example, an object returned by the new-style urlopen which was given
an "http" URL may provide different additional information and methods
to an object created from an "ftp" URL.

[...]

> With network protocols, however, there is much less overlap in both
> functionality and how you'd use them (and rightfully so since they are
> different protocols built to serve different purposes!) And that's the 
> whole reason why a generalized interface is nifty but less useful - the 
> protocols were built to do different things so trying to use them all the 
> same way is essentially dumbing them all down near the level of just a 
> file (and it's ok to have such a dumbing down function, but it's just not 
> the common case).

Again, I would say that many protocols lend themselves to a filesystem
type of view of the remote location. Perhaps this isn't always
appropriate, but it is an interesting abstraction which could form the
basis of an extended urllib. The very nature of URLs tends to suggest
a filesystem abstraction of network resources - that's why they were
invented, after all.

> I think I get your point, but I'd state it as "the key is to try to have
> each decreasingly common case build on the previous case" (so you don't
> have to relearn and you don't have to toss out work already done).

Indeed.

Paul