Too big of a list? and other problems

Tim Chase python.list at tim.thechases.com
Mon May 22 21:27:54 EDT 2006


>         pics = re.compile(r"images/.*\.jpeg")

While I'm not sure if this is the issue, you might be having some 
trouble with the greediness of the "*" repeater here.  HTML like

    <img src="images/1.jpeg"><img src="hello.jpeg">

will yield a result of

    "images/1.jpeg"><img src="hello.jpeg"

rather than the expected

    "images/1.jpeg"

You can make it "stingy" (rather than greedy) by appending a 
question-mark:

    r"images/.*?\.jpeg"

I also don't know if they all are coming back as "jpeg", or if 
some come back as "jpg", in which case you might want to use

    r"images/.*?\.jpe?g"

This still might bork up on things like

    <img src="images/a.gif"><img src="2.jpeg">

My first thought would be to install the BeautifulSoup parser, 
and then use it to snag all the <img> tags in your document. 
Then you know you're just getting the tag, and in turn, just 
getting their associated "src" attribute.  I do something like 
that in my comic-snatcher (scrapes comics from various sites so I 
can read them all in one place in one sitting).  You're welcome 
to remash this code excerpt (there's no guarantee it's great code):

req = urllib2.Request(url)
req.add_header("Referer", referer)
page = urllib2.urlopen(req)
bs = BeautifulSoup.BeautifulSoup()
map(bs.feed, page.readlines())
bs.done()
r = re.compile(targetRegex)
imageURLs = [img["src"] for img in bs.fetch("img")]
targetImageURL = [url for url in imageURLs if r.match(url)]

It does blithely assume every image has a "src" attribute as it 
should, but if not, you can put in an "if" clause in the 
assignment of imageURLs to only take those that have src attributes.

As others have mentioned as well, once you successfully get back 
the list of images, you'll likely want to *extend()* your master 
list of image URLs with your list of currently-found-URLs, rather 
than *append()*, or otherwise you'll end up with a list of lists 
which may not be what you want.

Just a few ideas you might want to try.

-tkc








More information about the Python-list mailing list