need some debug-infos on a simple regex

Martin Kaspar martin.kaspar at campus-24.com
Fri Nov 12 20:21:04 EST 2010


hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for  more than ten hours now!  And i
have several books here
- but they do not help at the moment. This code runs like a charme!!


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
    alk = urlparse.urljoin(url, lk)

    data = { 'url':alk, 'name':name, 'cname':capname }

    phtml = urllib.urlopen(alk).read()
    memail = re.search('<a href="mailto:(.*?)">', phtml)
    if memail:
        data['email'] = memail.group(1)

    print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex  and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target:  http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/"
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a><br><img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...

thx  matze




More information about the Python-list mailing list