[Tutor] Website Retrieval Program

Jacob S. keridee at jayco.net
Sat Sep 17 00:49:21 CEST 2005


> Daniel Watkins wrote:
>
>>I'm currently trying to write a script that will get all the files
>>necessary for a webpage to display correctly, followed by all the
>>intra-site pages and such forth, in order to try and retrieve one of the
>>many sites I have got jumbled up on my webspace. After starting the
>>writing, someone introduced me to wget, but I'm continuing this because
>>it seems like fun (and that statement is the first step on a slippery
>>slope :P).
>>
>>My script thus far reads:
>>"""
>>import re
>>import urllib
>>
>>source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
>>next = re.findall('src=".*html"',source.read())
>>print next
>>"""
>>
>>This returns the following:
>>"['src="nothing_left.html"', 'src="testindex.html"',
>>'src="nothing_right.html"']"
>>
>>This is a good start (and it took me long enough! :P), but, ideally, the
>>re would strip out the 'src=' as well. Does anybody with more re-fu than
>>me know how I could do that?
>>
>>Incidentally, feel free to use that page as an example. In addition, I
>>am aware that this will need to be adjusted and expanded later on, but
>>it's a start.
>>
>>Thanks in advance,
>>Dan
>>
>>_______________________________________________
>>Tutor maillist  -  Tutor at python.org
>>http://mail.python.org/mailman/listinfo/tutor

Orri wrote:
> Well, you don't necessarily need re-fu to do that:

Ah, but since he has already integrated re into the mix, perhaps it would be
simpler just to make the small re adaptation. Frankly, (gasp) I'm surprised
that
no one pointed it out.

import re
import urllib

source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
next = re.findall('src=(".*html")',source.read())
print next

Notice the parentheses added in the pattern.
If you do not want the quotes in the final outcome, move the parentheses to
the other
side of the quotes in the pattern.

Jacob



> import re
> import urllib
>
> source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html")
> next = [i[4:] for i in re.findall('src=".*html"',source.read())]
> print next
>
> gives you
>
> ['"nothing_left.html"', '"testindex.html"', '"nothing_right.html"']
>
> And if you wanted it to specifically take out "src=" only, I'm sure you
> could tailor it to do something with i[i.index(...):] instead.



More information about the Tutor mailing list