webspider, regexp not working, why?

notnorwegian at yahoo.se notnorwegian at yahoo.se
Fri May 23 12:42:57 EDT 2008


url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
{1}
([\w\-]+\.)+
([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?
(&
\w+=\w+)*)?")

why isnt this url catching something like:

<link rel="alternate" type="application/rss+xml" title="Python
Screencasts"
        href="http://www.showmedo.com/latestVideoFeed/rss2.0?
tag=python" />

site = urllib.urlopen("http://www.python.org")
for row in site:
    obj = url.search(row)
    if obj != None:
        print "url: ", obj.group()

i know it works because it can catch
www.hello.com in a txt-file and i can catch emails of websites with
another regexp.

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

i see now that it has to match the beginning of the row or something,
because:
hi www.google.com
doesnt match but
www.google.com  hi
matches.


i though a regexp would search a row/file and when it finds an
occurence report it, so a regexp of "lo" would match in lopez.



More information about the Python-list mailing list