Too big of a list? and other problems

John Machin sjmachin at lexicon.net
Mon May 22 22:10:31 EDT 2006


On 23/05/2006 10:19 AM, Brian wrote:
> First off, I am sorry for cluttering this group with my inept
> questions, but I am stuck again despite a few hours of hair pulling.
> 
> I have a function (below) that takes a list of html pages that have
> images on them (not porn but boats).  This function then (supposedly)
> goes through and extracts the links to those images and puts them into
> a list, appending with each iteration of the for loop.  The list of
> html pages is 82 items long and each page has multiple image links.
> When the function gets to item 77 or so, the list gets all funky.
> Sometimes it goes empty,

The list (not a tuple!!) found by findall is empty or smaller than 
expected when the webmaster has used .jpg instead of .jpeg. Pages 27, 
77, and 79-82 at the moment have all .jpg as you would have found out 
had you inspected the actual data you are operating on instead of 
guessing. The print statement is your friend; use it. Your browser's 
"view source" functionality (ctrl-U in Firefox) is also handy.

However if you mean that your foundPics list becomes empty, then either 
you haven't posted the code that you actually used, or the pixies from 
the bottom of the garden have been rearranging it for you :-)

  and others it is a much more abbreviated list
> than I expect - it should have roughly 750 image links.
> 
> When I looked at it while running, it appears as if my regex is
> actually appending a tuple (I think) of the results it finds to the
> list.

No, read the manual. findall returns a list. *You* are appending that 
list to your list.

> My best guess is that the list is getting too big and croaks.

Very unlikely. In any case you would have seen evidence, like an 
exception and a traceback ... or maybe just your swap disk going into 
overdrive :-)

> Since one of the objects of the function is also to be able to count
> the items in the list, I am getting some strange errors there as well.

And what were the strange errors that you perceived?

> 
> Here is the code:
[snip]
Here is mine:

import re, urllib
def countPics():
     foundPics = []
     links_count = 0
     pics_count = 0
     pics = re.compile(r"images/.*\.jpeg")
     # for better results, change jpeg to jpe?g
     for link in ["cetaceaPage%02d.html" % x for x in range(1, 83)]:
         picPage = 
urllib.urlopen("http://continuouswave.com/whaler/cetacea/" +
                                  link)
         links_count += 1
         html = picPage.read()
         picPage.close()
         findall_result = pics.findall(html)
         pics_count += len(findall_result)
         print links_count, pics_count, link, findall_result
         foundPics.append(findall_result)
     print("done")

countPics()

You may wish to change that append to extend, but then you will lose 
track of which pictures are on which page, if that matters to you.

HTH,
John



More information about the Python-list mailing list