Oh what a twisted thread we weave....

Fri Oct 28 20:29:46 EDT 2005

Hi

First off I'm not using anything from Twisted. I just liked the subject
line :)

The folks of this list have been most helpful before and I'm hoping
that you'll take pity on a the dazed and confused. I've read stuff on
this group and various website and book until my head is spinning...

Here is a brief summary of what I'm trying to do and an example below.
I have the code below in a single threaded version and use it to test a
list of roughly 6000 urls ensure that they "work". If they fail I track
the kind of failures and then generate a report. Currently it take
about 7 - 9 hours to run through the entire list. I basically create a
list from a file containing a  list of URLS and then iterate over the
list and check each page as I go through the list. I get all sort of
flack because it takes so long so I thought I could speed it up by
using a Queue and X number of threads. Seems easier said then done.

However in my test below I can't even get it to catch a single error in
my if statement in the Run() function. I'm stumped as to why. Any help
would be Greatly appreciated. and if so inclined pointers on how to
limit the number of threads of a give number of threads.

Thank you in advance! I really do appreciate it

here is what I have so far... Yes there are somethings that are unused
from previous test. Oh and to give proper credit this is based on some
code from  http://starship.python.net/crew/aahz/OSCON2000/SCRIPT2.HTM

import threading, Queue
from time import sleep, time
import urllib2
import formatter
import string
#toscan = Queue.Queue
#scanned = Queue.Queue
#workQueue = Queue.Queue()

MAX_THREADS = 10

timeout = 90			# sets timeout for urllib2.urlopen()
failedlinks = []		# list for failed urls
zeromatch = []			# list for 0 result searches
t = 0 					# used to store starting time for getting a page.
pagetime = 0			# time it took to load page
slowestpage = 0			# slowest page time
fastestpage = 10 		# fastest page time
cumulative = 0			# total time to load all pages (used to calc. avg)
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

class Retriever(threading.Thread):
	def __init__(self, URL):
		self.done = 0
		self.URL = URL
		self.urlObj = ''
		self.ST_zeroMatch = ST_zeroMatch
		print '__init__:self.URL', self.URL
		threading.Thread.__init__(self)

	def run(self):
		print 'In run()'
		print "Retrieving:", self.URL
		#self.page = urllib.urlopen(self.URL)
		#self.body = self.page.read()
		#self.page.close()
		self.t = time()
		self.urlObj = urllib2.urlopen(self.URL)
		self.pagetime = time() - t
		self.webpg = self.urlObj.read()
		print 'Retriever.run: before if'
		print 'matching', self.ST_zeroMatch
		print ST_zeroMatch
# why does this always drop through even though the If should be true.
		if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:
			# I don't think I want to use self.zeromatch, do I?
			print '** Found zeromatch'
			zeromatch.append(url)
		#self.parse()
		print 'Retriever.run: past if'
		print 'exiting run()'
		self.done = 1

# the last 2 Shop.Com Urls should trigger the zeromatch condition
sites = ['http://www.foo.com/',
    'http://www.shop.com',
    'http://www.shop.com/op/aprod-~zzsome+thing',
    'http://www.shop.com/op/aprod-~xyzzy'
    #'http://www.yahoo.com/ThisPageDoesntExist'
    ]

threadList = []
URLs = []
workQueue = Queue.Queue()

for item in sites:
    workQueue.put(item)

print workQueue
print
print 'b4 test in sites'

for test in sites:
    retriever = Retriever(test)
    retriever.start()
    threadList.append(retriever)

print 'threadList:'
print threadList
print 'past for test in sites:'

while threading.activeCount()>1:
	print'Zzz...'
	sleep(1)

print 'entering retriever for loop'
for retriever in threadList:
    #URLs.extend(retriever.run())
	retriever.run()

print 'zeromatch:', zeromatch
# even though there are two URLs that that should be here nothing ever
gets appeneded to the list.