urllib, urllib2, httplib -- Begging for consolidation?

Wed Jun 5 13:36:31 EDT 2002

On Wed, 5 Jun 2002, John J. Lee wrote:

> I'm trying to understand your posts on this,

Hi John,

Before I jump in I do want to emphasize that I'm not trying to be too 
critical of httplib/urllib/urllib2 - the OP asked if they should be 
consolidated, and my short response is "yes, and reorganized too". :-)

Anyway, my concerns live on two levels:

1) It's very non-obvious, especially to newcomers, which library should be
used when. A newbie will probably find httplib without too much
difficulty, but if what they need isn't there, why would they look in
urllib? (not to mention urllib2!) To me this is annoying and the cause of
some wheel-reinventing when they don't know about urllib, but not too
great of a problem - it's a naming problem, it's just a matter of
learning, and it helps keep the Python book authors in business. ;-)

2) As richer HTTP functionality is added to the standard library, it is 
getting added to the wrong place, such that it is only narrowly 
accessible. This is my main concern, which I'll try to elaborate on a 
little more.

urllib gives you a very generic facility to retrieve an object, and that
should not change too much: if you have no special needs then it works
great. As your needs require you to descend to lower level APIs (e.g. for
custom functionality), that transition should as much as possible be (1)  
straightforward, (2) obvious, and (3) one that allows you to reuse as much
of the related knowledge and work. My assertion is that right now we don't
have those things.

Again, speaking ideally, facilities for the retrieval of objects could be 
organized into a hierarchy ranging from very high level to low level, 
e.g. (many branches and leaves removed):

urllib
 |
httplib ftplib gopherlib
 |
 ... (possibly several layers here)
 |
socket

So, depending on your specific needs, you'd "plug in" to this hierarchy at
the appropriate level. If all you need is to go fetch some object, you'd
probably go with urllib. If you know that you're using HTTP, you'd
probably use the highest level HTTP layer as your starting point. If you
were building a custom client that worked over a mythical HTTP/1.5 you'd
start even lower, on down until you're handling the sockets themselves.  
The key though is that each level you'd have to reinvent as little as
possible - you'd build on related work from lower levels.

Obviously getting such a hierachy to work right would be tough and not as 
clean (there's not a strict parent-child relationship between all degrees 
of custom features so there'd probably be lots of mix-in class 
functionality instead of enforced layering). The approach we have today is 
*sort of* like this, except that richer and smarter functionality is being 
added at the *top* of the hierarchy. This has two negative side effects: 
(1) it pollutes the higher level, generic interfaces and (2) it makes it 
difficult to reuse functionality if your needs force you to start out at a 
lower level.

Problem #1 is what makes me throw my hands up in frustration when we talk
about, e.g., expanding the urllib APIs to have some way to do a HEAD
request: it doesn't belong on that level of functionality. That generic
interface is for making it trivially easy to fetch the contents associated
with a URL, independent as much as possible from protocol. By crufting it
up in an attempt to make it all things to all people, it is no longer a
clean, generic API. For this specific example, a HEAD request doesn't make
sense in terms of FTP, so you shouldn't be able to do a HEAD request using
the generic API. (an alternative would be to redesign the API to include
GetSize functionality, that would then take appropriate protocol-specific
action under the covers, but the per-protocol smarts ("how do I get the
size of an FTP object?" would certainly not live inside the generic API
module) Again in the example of a HEAD request, in order to do anything
useful with the results, your code would have to realize that it was in
fact an HTTP HEAD response, so being able to call a protocol-agnostic API
to get those results doesn't add much value since the calling code has to 
be aware of the protocol anyway.

Previously in this thread I've focused my attention on problem #2 (reuse
of functionality), as that's the one that I've run into the most. As it
stands today, you benefit from all sorts of nifty HTTP behavior, but only
if what you need to do works through the urllib APIs. If specific needs
force you to descend that protocol/functionaliy tree (or if you simply
know in advance that you'll be dealing with a specific protocol), you
lose, and have to reimplement it all yourself. From a user's point of
view, this just doesn't make sense. Again using the most recent example of
an HTTP HEAD request, it seems odd that the standard library (urllib)  
happily follows redirects, but if I simply want to do a HEAD request (I
simply need slightly more control over what's happening) I have to
implement that functionality myself.

> Perhaps part of your point is simply that the division into modules should
> better reflect the conceptual structure?  The current organisation may (or
> may not) have been a mistake, but it's not a significant enough mistake to
> warrant reorganisation, is it?

Well, like I said before, this was all in response to the query "should
they be consolidated?", so IF we were to invest the time to consolidate,
THEN we should also reorganize. Also, it seems (this is very much my own
interpretation of course) that as these modules have evolved, the model
around which they are built has become fuzzy, or that they have outgrown
their original model altogether and the current implementation is showing
signs of stress. For example, the very existance of a module whose name is
"urllib2" says there's no longer a clear answer to the question "where
should new feature Y or class of features X live?". So, maybe nearly all
of what's there could be reused as is, but the decision of where to put
what and why would be driven by a clear and cleaner big picture (I've
never advocated throwing it all away, for example, as there's lots of
really good stuff in there).

Probably 90% of the problem is naming and/or module organization - urllib, 
a module to retrieve the contents given a generic URL, should be just that 
and little else, and protocol-specific knowledge should be accessible from 
protocol-specific modules.

Have fun,
-Dave