[Python-Dev] urllib2 EP + decr. startup time

Fri Feb 16 15:38:41 CET 2007

Hello to all.

During more than two years i widely use urllib2 to write
commercial applications (almost for extracting data from web sites to
excel sheets)
and here is some enhanced enhanced for it:

1) Add support for 'HEAD' request (and maybe some other).
This needs small changes.
   a)Add request_type = 'GET' to urllib2.Request class constructor.
   b)Then put request_type value pass to http header, except Request has
      data - in this case it's change to 'POST'.
The results of such request will be the same as in case of 'GET' request,
except zero size of body.

2)HTTH keep-alive opener. Almost complete realizations can be found
in urlgrabber (http://linux.duke.edu/projects/urlgrabber)(used by yum, so tested
well enough, i think). It's use urllib2 opener protocol and well integrated in
urllib2 structure. They need just little change to properly support
some headers.

3) Save HTTP exchange history. Now there is no suitable way to
obtain all sent and received headers. Received headers are saved only
for last response in redirection chain and sent headers are not saved at all.
I use run-time patching of httplib to intercept of the sent and received
data (may be i missed something?). Proposal is to add property
'history' to object returned from urllib2.urlopen - list
of objects which contain send/recv headers for all redirect chain.

4) Add possibilities to obtain underlying socket, used for recv http data.
Now it's impossible to work with http connection in async matter
(or i miss something again?).
If connection hangs then whole program hangs too and i don't known way
to fix this.
Of cause if you obtain such socket then you respond for compression and etc.
Now i use following code:
x = urrlib2.urlopen(.....)
sock =  x.fp._sock.fp._sock.
There only one problem, as i know, - chunked encoding. In case of
chunked encoding need to return socket-like object which
do all work to assemble chunks in original stream. I already use
such object for two years and it's ok.

5) And now for something completely different ;)).

   This is just initial proposal and it needs enhancement.  May be i
need put it to python-ideas list?

   At last Goggle SOC there was one of problem to solve - the decrease
of interpreter's startup time.
   'strace' command shows next: most of startup time the interpreter
try to find imported modules.
   And most of them finished with 'not found' error, because of large
size of sys.path variable.
   In future this time will be increase - setuptools adds many dirs to
search path
   using pth files (to manage installed modules and eggs).

   I propose to add something like .so caching which used in modern
*nix sytems to load
   shared libraries.

   a) Add to python interpreter --build-modules-index option. When python found
   this opts it scans all dirs in paths and build dictionary
{module_name:module_path}.
   Dict will be saved in external file (save only top-dir for packages
and path for one-file modules).
   Also it saves in this file mtime for all pth files and dirs from
path and path variable.

   b) When interpreter is started-up it, as usually, scans all path
dirs for pth files,
   and using data saved in cache file check is new modules or search
dirs added or
   old modified.
   Then it read cache file and compares mtimes and path dirs. If it
isn't modified then
   cache data used for fast module loading. If imported module didn't found in
   cache - interpreter falls back to standard scheme.

   Also it is necessary to add some options to control of using cache
like --disable-cache,
   --clear-cache,disable cashing some dirs, etc.
---
K.Danilov aka KoDer