Python script help
Piet van Oostrum
piet at vanoostrum.org
Fri Aug 23 22:37:14 EDT 2013
cool1574 at gmail.com writes:
> Here are some scripts, how do I put them together to create the script
> I want? (to search a online document and download all the links in it)
> p.s: can I set a destination folder for the downloads?
You can use os.chdir to go to the desired folder.
>
> urllib.urlopen("http://....")
>
> possible_urls = re.findall(r'\S+:\S+', text)
>
> import urllib2
> response = urllib2.urlopen('http://www.example.com/')
> html = response.read()
If you insist on not using wget, here is a simple script with
BeautifulSoup (v4):
########################################################################
from bs4 import BeautifulSoup
from urllib2 import urlopen
from urlparse import urljoin
import os
import re
os.chdir('OUT')
def generate_filename(url):
url = re.sub('^[a-zA-Z0-9+.-]+:/*', '', url)
return url.replace('/', '_')
URL = "http://www.example.com/"
soup = BeautifulSoup(urlopen(URL).read())
links = soup.select('a[href]')
for link in links:
url = urljoin(URL, link['href'])
print url
html = urlopen(url).read()
fn = generate_filename(url)
with open(fn, 'wb') as outfile:
outfile.write(html)
########################################################################
You should add a more intelligent filename generator, filter out mail:
urls and possibly others and add exception handling for HTTP errors.
--
Piet van Oostrum <piet at vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
More information about the Python-list
mailing list