how to count and extract images

Mike Meyer mwm at mired.org
Sun Oct 23 20:55:21 EDT 2005


Joe <dinamo99 at lycos.com> writes:
> start = s.find('<a href="somefile') + len('<a
> href="somefile') 
> stop = s.find('">Save File</a></B>',
> start) fileName = s[start:stop]
> and then construct the url with the filename to download the image 
> which works fine as cause every image has the Save File link and I can
> count number of images easy the problem is when there is more than image I
> try using while loop downlaod files, wirks fine for the first one but
> always matches the same, how can count and thell the look to skip the fist
> one if it has been downloaded and go to next one, and if next one is
> downloaded go to next one, and so on.

To answer your question, use the first optional argument to find in both
invocations of find:

stop = 0
while end >= 0:
      start = s.find('<a href="somefile', stop) + len('<a href="somefile')
      stop = s.find('">Save File</a></B>', start)
      fileName = s[start:stop]

Now, to give you some advice: don't do this by hand, use an HTML
parsing library. The code above is incredibly fragile, and will break
on any number of minor variations in the input text.  Using a real
parser not only avoids all those problems, it makes your code shorter.
I like BeautifulSoup:

soup = BeautifulSoup(s)
for anchor in soup.fetch('a'):
    fileName = anchor['href']

to get all the hrefs. If you only want the ones that have "Save File"
in the link text, you'd do:

soup = BeautifulSoup(s)
for link in soup.fetchText('Save File'):
    fileName = link.findParent('a')['href']

    <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list