Converting relative URLs to absolute

Paul Boddie paul at boddie.net
Wed Mar 13 06:31:03 EST 2002


Fernando Pérez <fperez528 at yahoo.com> wrote in message news:<a6mpda$79e$1 at peabody.colorado.edu>...
> James A Roush wrote:
> 
> > Does anyone have any code that, given that absolute URL of a web page, can
> > convert all the relative URLs on that page to their absolute equivalent?
> 
> Assuming absolute is a string and relative_list a list of strings, the 
> followinng comes to mind:
> 
> [absolute+'/'+relative for relative in relative_list]
> 
> Maybe you wanted something fancier, don't know.

I suppose it would be nicer or more appropriate to deal with "back
references" as well as being able to detect "base" elements. For
example, given the following "base"...

  http://www.python.org/invented/framework/demo/

...and the following URLs...

  moreinfo.html
  docs.html
  ../apps.html
  ../../stuff.html
  /index.html
  http://www.zope.org

...one would want to remove certain parts of the "base" before
concatenating the relative URLs to it. Thus, we would produce these
absolute URLs:

  http://www.python.org/invented/framework/demo/moreinfo.html
  http://www.python.org/invented/framework/demo/docs.html
  http://www.python.org/invented/framework/apps.html
  http://www.python.org/invented/stuff.html
  http://www.python.org/index.html
  http://www.zope.org

I'm not so sure that urllib supports such operations, at least not in
any version of it that I have (from Python 2.0 or 2.1). Instead,
there's some fairly low-level split operations which aren't especially
useful in this case. In addition, you might need to use some parser to
get hold of any "base" elements in the HTML.

I've written some page-mining tools which help with these kinds of
activities, and I suppose I should get round to releasing them at some
point. Let me know if you're interested!

Paul



More information about the Python-list mailing list