downloading from links within a webpage

Dave Angel davea at davea.name
Tue Oct 14 13:47:20 EDT 2014


Shiva <shivaji_tn at yahoo.com.dmarc.invalid> Wrote in message:
> Hi,
> 
> Here is a small code that I wrote that downloads images from a webpage url
> specified (you can limit to how many downloads you want). However, I am
> looking at adding functionality and searching external links from this page
> and downloading the same number of images from that page as well.(And
> limiting the depth it can go to)
> 
> Any ideas?  (I am using Python 3.4 & I am a beginner)
> 
> import urllib.request
> import re
> url="http://www.abc.com"
> 
> pagehtml = urllib.request.urlopen(url)
> myfile = pagehtml.read()
> matches=re.findall(r'http://\S+jpg|jpeg',str(myfile))
> 
> 
> for urltodownload in matches[0:50]:
>   imagename=urltodownload[-12:]
>   urllib.request.urlretrieve(urltodownload,imagename)
> 
> print('Done!')
>  
> Thanks,
> Shiva
> 
> 

I'm going to make the wild assumption that you can safely do both
 parses using regex, and that finding the jpegs works well enough
 with your present one.

First thing is to make most of your present code a function
 fetch(), starting with pagehtml and ending before the print. The
 function should take two arguments,  url and depthlimit.

Now, at the end of the function,  add something like

    If depthlimit > 0:
        matches = ... some regex that finds links
        for link in matches [:40]:
             fetch (link, depthlimit - 1)

Naturally, the rest of the top level code needs to be moved after
 the function definition,  and is called by doing something
 like:

fetch (url, 10)   to have a depth limit of 10.
            
-- 
DaveA




More information about the Python-list mailing list