how to count and extract images
Mike Meyer
mwm at mired.org
Sun Oct 23 20:55:21 EDT 2005
Joe <dinamo99 at lycos.com> writes:
> start = s.find('<a href="somefile') + len('<a
> href="somefile')
> stop = s.find('">Save File</a></B>',
> start) fileName = s[start:stop]
> and then construct the url with the filename to download the image
> which works fine as cause every image has the Save File link and I can
> count number of images easy the problem is when there is more than image I
> try using while loop downlaod files, wirks fine for the first one but
> always matches the same, how can count and thell the look to skip the fist
> one if it has been downloaded and go to next one, and if next one is
> downloaded go to next one, and so on.
To answer your question, use the first optional argument to find in both
invocations of find:
stop = 0
while end >= 0:
start = s.find('<a href="somefile', stop) + len('<a href="somefile')
stop = s.find('">Save File</a></B>', start)
fileName = s[start:stop]
Now, to give you some advice: don't do this by hand, use an HTML
parsing library. The code above is incredibly fragile, and will break
on any number of minor variations in the input text. Using a real
parser not only avoids all those problems, it makes your code shorter.
I like BeautifulSoup:
soup = BeautifulSoup(s)
for anchor in soup.fetch('a'):
fileName = anchor['href']
to get all the hrefs. If you only want the ones that have "Save File"
in the link text, you'd do:
soup = BeautifulSoup(s)
for link in soup.fetchText('Save File'):
fileName = link.findParent('a')['href']
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
More information about the Python-list
mailing list