[Web-SIG] So what's missing?

Wed Oct 29 02:17:57 EST 2003

On Monday, October 27, 2003, at 09:00 AM, John J Lee wrote:
> On Sun, 26 Oct 2003, Ian Bicking wrote:
>> On Sunday, October 26, 2003, at 07:24 AM, John J Lee wrote:
> [...]
>> Essentially we'd just move HTTPBasicAuthHandler.http_error_401 into
>> HTTPHandler.  You could still override it, and HTTPBasicAuthHandler
>> would still override it (and somewhat differently, because
>> HTTPHandler.http_error_401 should handle both basic and digest auth).
>> It's a pretty small change, really.
>
> So is the benefit.  It's
>
> a = HTTPBasicAuthHandler()
> a.add_password(user="joe", password="joe")
> o = build_opener(a)
>
> vs.
>
> o = build_opener(HTTPHandler(user="joe", password="joe"))
>
>
> (assuming defaults for realm and uri -- BTW, there seems to be an
> HTTPPasswordMgrWithDefaultRealm already, which I guess is some way to 
> what
> you want)

Yes, I just recently noticed that too.  Why it is implemented in a 
separate class I cannot fathom.

> If we're still using build_opener, and HTTPBasicAuthHandler were to
> override HTTPHandler, it would have to be derived from it.  Not that a
> build_opener work-alike couldn't be devised, of course.
>
> [...]
>>> I'm still waiting for that example.
>>
>> I thought I gave examples: documentation, proliferation of classes,
>> non-orthogonality of features (e.g., HTTPS vs. HTTP isn't orthogonal 
>> to
>> authentication).
>
> Lack of documentation doesn't justify changes to the code.  There is 
> not
> any harmful proliferation of classes, I think: the function of the
> handlers is pretty obvious in most cases (though obviously the docs 
> could
> be better).  I don't recognize the orthogonality problem you're 
> referring
> to.

I'm not as concerned with the internals, but rather the exposed 
interface.  This isn't a concern purely about lack of documentation 
either, but about the thoroughness and conciseness of that 
documentation.  A good interface lends itself to good documentation.  I 
don't think this interface can result in good documentation -- it will 
either be incomplete, difficult to navigate, or verbose (or all), as a 
reflection of the way in which internal implementation is exposed.

> [...]
>> urlopen('http://whatever.com',
>>      username='bob',
>>      password='secret',
>>      postFields={...},
>>      postFiles={'image': ('test.jpg', '... image body ...')},
>>      addHeaders={'User-Agent': 'superbot 3000'})
> [...]
>> write than any OO-based system.  I'm concerned about the external ease
>> of use, not the internal conceptual integrity.
>
> OK, maybe I'm overconcerned about this layer -- if it's a simple
> convenience thing like this, fine (as long as it actually is useful
> and simple, of course).
>
> My biggest concern was that you seemed to be advocating a new UserAgent
> class, which would presumably more-or-less duplicate OpenerDirector 
> (you
> probably want to skip to the end of this post at this point, because I
> think you may have missed a crucial point about that class).
> OpenerDirector is not such a great name, actually: maybe UserAgent or
> URLOpener would have been better...
>
>>>> authentication information (and it doesn't obey browser URL
>>>> conventions, like http://user:password@domain/).
>>>
>>> What is that convention?  Is it standardised in an RFC?
>>
>> It's a URL convention that's been around a very long time, I don't 
>> know
>> if it is in an RFC.
>>
>>> I see
>>> ProxyHandler knows about that syntax.  Obviously it's not an 
>>> intrinsic
>>> limitation of the handler system.
>>
>> I don't really know how a handler is chosen -- can it figure out
>> whether it should use HTTPHandler, HTTPBasicAuthHandler, or
>> HTTPDigestAuthHandler just from this URL?  Obviously basic vs. digest
>> can't be determined until you try to fetch the object.
>
> The user and password here are for the proxy, not the server (there's 
> some
> code duplication here actually, but that's just a bug).  Dunno if 
> that's
> standard use of that syntax.
>
>
> [...]
>>> Mind you, if your idea can do the same job as my RFE, then it should
>>> certainly be considered alongside that.
>>
>> Hmm... I just looked at the RFE now, so I'm still not sure what it
>> would mean to this.
>
> Sorry, I don't understand 'what it would mean to this'.  What's 'this'?

This discussion.

>>>> Yet none of these features
>>>> would be all that difficult to add via urlopen or perhaps other 
>>>> simple
>>>> functions, (instead of via classes).  I don't think there's any need
>>>> for classes in the external API -- fetching URLs is about doing
>>>> things,
>>>> not representing things, and functions are easier to understand for
>>>> doing.
>>>
>>> Details?  The only example you've given so far involved a UserAgent
>>> class.
>>
>> Details about what?  Your asking for details and examples, but I've
>> provided some already and I don't know what you're looking for.
>
> You provided some examples of features you think would require some 
> kind
> of layer on top of urllib2.  I thought you were originally suggesting a
> new UserAgent class or similar (that was you, wasn't it?).  I don't 
> think
> that's necessary.

In the context of stateful HTTP requests, yes, I still think some 
object along the lines of a UserAgent is the best interface.

> But in the post I'm replying to here, you gave an example of adding 
> args
> to urlopen.  I do agree that something like that could be useful. I 
> think
> the docs should be changed here to make it clear that urlopen is just a
> convenience function that uses a global OpenerDirector.
>
> [...]
>>>> I think fetching and caching are two separate things.  The caching
>>>> requires a context.  The fetching doesn't.  I think fetching things
>>>
>>> The context is provided by the handler.
>>
>> But we're fetching URLs, not handlers.  The URL is context-less,
>> intrinsically.  The handler isn't context-less, but that's part of 
>> what
>> I don't like about urllib2's handler-oriented perspective.
>
> I don't understand what you just said, but I think we're agreed 
> something
> that doesn't require calling build_opener or OpenerDirector.add_handler
> could be convenient.

Okay, good.  That my statement was nonsensical was part of my point, 
but that's probably not a helpful way to make a point ;)

>>> [...]
>>>> I also don't see how caching would fit very well into the handler
>>>> structure.  Maybe there'd be a HTTPCachingHandler, and you'd
>>>> instantiate it with your caching policy? (where it stores files, how
>>>> many files, etc)  Also a HTTPBasicAuthCachingHandler,
>>>> HTTPDigestAuthCachingHandler, HTTPSCachingHandler, and so on?  This
>>>> caching is orthogonal -- not just to things like authentication, but
>>>
>>> My assumption was that it wasn't orthogonal, since RFC 2616 seems to
>>> have
>>> rather a lot to say on the subject.
>>
>> Well, if they aren't orthogonal, then they should all be implemented 
>> in
>> a single class.
>
> Yes.  Off the top of my head, I'd say something like (taking note of 
> your
> point below about needing to actually cache responses as well as return
> cached data!):
>
> class AbstractHTTPCacheHandler:
>     def cached_open(self, request):
>         # return cached response, or None if no cache hit
>     def cache(self, response):
>         # cache response if appropriate
>
> class HTTPCacheHandler(AbstractHTTPCacheHandler):
>     http_open = cached_open
>     http_response = cache
>
> or, if you want a class that does both HTTP and HTTPS:
>
> class HTTPXCacheHandler(AbstractHTTPCacheHandler):
>     https_open = http_open = cached_open
>     https_response = http_response = cache
>
>
> [...]
>> Why not have just one good HTTP handler class?
>
> Why would you want one when you can easily do whatever you want with a
> convenience function or two, and / or a class derived from 
> OpenerDirector,
> or something that works like build_opener, etc.?  Not so easy to go in 
> the
> other direction, and separate out the various features of a big,
> all-singing all-dancing HTTP handler.  That was a big part of the
> motivation for urllib2 in the first place: inflexibility of urllib.

Why would I want two pieces if I could have one that can do both their 
jobs?  And why fold different ideas together into one notion of 
handler?  HTTP and HTTPS are almost exactly the same.  Basic and digest 
auth are almost exactly the same.  Using a cache and not using a cache 
are almost exactly the same.  All these details can be combined 
reliably in many ways, but the structure of handlers seems to get in 
the way.

But maybe this comes down to a disagreement about coding aesthetics.  I 
don't like inheritance, especially when it gets clever.  But if that's 
just an implementation detail, then eh... I can live.  It's when it 
gets exposed through the public interface (as it is in urllib2) that it 
bothers me.

[...]
>> 2 won't work, since CacheHandler can't
>> return None and let someone else do the work, because it has to know
>> about what the result is so that it can cache the result.
>
> At last, a real problem!  Actually, I think this is a problem already
> solved by my 'processors' idea, though perhaps not quite in its current
> form -- that should be easy to fix, though (ATM, IIRC, they're separate
> from handlers: you can't have an object that is both a handler and a
> processor -- and they don't currently have default_request and
> default_response methods, either).

The processors really sound like wrappers to me.

>> I missed that when you posted it.  That might handle some of these
>> features.  It seems a little too global to me.  For instance, how 
>> would
>> you handle two distinct user agents with respect to the referer 
>> header?
>
> Two OpenerDirectors!
>
> new_opener = build_opener()
> new_opener.addheaders = [("User-agent", "Mozilla/5.0")]
>
> old_opener = build_opener()
> old_opener.addheaders = [("User-agent", "Mozilla/4.0")]
>
> new_opener.open("http://www.a.com/")
> old_opener.open("http://www.b.com/")

Okay, I didn't realize that.  That makes it much better, though the 
name OpenerDirector distracts.

>> Seems like it would also make sense as a OpenerDirectory
>> subclass/wrapper.
>
> IIRC, there are issues with redirection that prevent that.

How so?  For instance, with referer, don't you essentially just want to 
do something like:

class RefererDirector(OpenerDirector):
     def __init__(self):
         OpenerDirector.__init__(self)
         self.last_url = ''
     def open(self, fullurl, data=None):
         if isinstance(fullurl, str):
             fullurl = Request(fullurl)
         if self.last_url:
             fullurl.add_header('HTTP-Referer', self.last_url)
         result = OpenerDirector.open(self, fullurl, data=data)
         self.last_url = result.geturl()
         return result

This is essentially how a browser works, isn't it?  Does a header get 
lost somewhere?  If so, then that seems like a bug in the handler.

--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org