Spider - path conflict [../test.htm,www.nic.nl/index.html]

martijn at gamecreators.nl martijn at gamecreators.nl
Fri Apr 1 07:08:14 EST 2005


H!

I thought I was ready with my own spider...
But then there was a bug, or in other words a missing part in my code.

I forget that people do this in website html:
<a href="http://www.nic.nl/monkey.html">is oke</a>
<a href="../monkey.html">error</a>
<a href="../../monkey.html">error</a>

So now i'm trying to fix my spider but it fails and it fails.
I tryed something like this.

------------------------
import urlparse
import string

def fixPath(urlpath,deep):
    path=''
    test = urlpath.split('/',deep)
    for this in test:
        if this<>'' and this.count('.')==0:
            path=path+'/'+this
    return path

def fixUrl2(src,url):
    url = urlparse.urlparse('http://'+url)
    src = urlparse.urlparse('http://'+src)

    if url[2]:
        thepath = fixPath(url[2],url[2].count('/')-(src[2].count('/')))

    if src[1] == '..':
        if url[1]<>'':
            theurl = url[1]+''+thepath+''+src[2].replace('../','')

    print theurl

fixUrl2('../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../../monkey2.html','www.nic.nl/test/info/monkey1.html')
fixUrl2('../monkey2.html','www.nic.nl/info/monkey1.html')
------------------------

info:
fixUrl2('a new link found','in this page')

I hope someone knows a professional working code for this,
Thanks a lot,
GC-Martijn




More information about the Python-list mailing list