Oh what a twisted thread we weave....

GregM gregm at taming-tech.com
Mon Oct 31 14:25:43 EST 2005


Tom,

Thanks for the reply and sorry for the delay in getting back to you.
Thanks for pointing out my logic problem. I had added the 2nd part of
the if statement at the last minute...

Yes I have a single threaded version its several hundred lines and uses
COM to write the results out to and Excel spreadsheet.. I was trying to
better understand threading and queues before I started hacking on my
current code... maybe that was a mistake... hey I'm still learning and
I learn a lot just by reading stuff posted to this group. I hope at
some point I can help others in the same way.

Here are the relevent parts of the code (no COM stuff)

here is a summary:
# see if url exists
# if exists then
# 	hit page
# 	get text of page
# 	see if text of page contains search terms
#	if it does then
#		update appropiate counters and lists
#	else update static line and do the next one
# when done with Links list
#	- calculate totals and times
#	- write info to xls file
# end.

# utils are functions and classes that I wrote
# from utils import PrintStatic, HttpExists2
#
# My version of 'easyExcel' with extentions and improvements.
# import excelled
import urllib2
import time
import socket
import os
#import msvcrt         # for printstatic
from datetime import datetime
import pythoncom
from sys import exc_info, stdout, argv, exit

# search terms to use for matching.
#primarySearchTerm = 'Narrow your'
ST_lookingFor = 'Looking for Something'
ST_errorConnecting = 'there has been an error connecting'
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

#initialize Globals
timeout = 90			# sets timeout for urllib2.urlopen()
failedlinks = []		# list for failed urls
zeromatch = []			# list for 0 result searches
pseudo404 = []			# list for shop.com 404 pages
t = 0 					# used to store starting time for getting a page.
count = 0				# number of tests so far
pagetime = 0			# time it took to load page
slowestpage = 0			# slowest page time
fastestpage = 10 		# fastest page time
cumulative = 0			# total time to load all pages (used to calc. avg)

#version number of the program
version = 'B2.9'

def ShopCom404(testUrl):
	""" checks url for shop.com 404 url
		shop.com 404 url -- returns status 200
		http://www.shop.com/amos/cc/main/404/ccsyn/260
	"""
	if '404' in testUrl:
		return True
	else:
		return False

##### main program #####

try:
	links = open(testfile).readlines()
except:
	exc, err, tb = exc_info()
	print 'There is a problem with the file you specified. Check the file
and re-run the program.\n'
	#print str(exc)
	print str(err)
	print
	exit()

# timeout in seconds
socket.setdefaulttimeout(timeout)
totalNumberTests = len(links)
print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n'
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
start = datetime.today()
for url in links:
	count = count + 1
	#HttpExists2 - checks to see if URL exists and detects redirection.
	# handles 404's and exceptions better. Returns tuple depending on
results:
	# if found: true and final url.	if not found: false and attempted url
	pgChk = HttpExists2(url)
	if pgChk[0] == False:
		#failed url Exists
		failedlinks.append(pgChk[1])
	elif ShopCom404(pgChk[1]):
		#Our version of a 404
		pseudo404.append(url)
	if pgChk[0] and not ShopCom404(url):
		#if valid page not a 404 then get the page and check it.
		try:
			t = time.time()
			urlObj = urllib2.urlopen(url)
			pagetime = time.time() - t
			webpg = urlObj.read()
			if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):
				zeromatch.append(url)
			elif ST_errorConnecting in webpg:
			# for some reason we got the error page
			# so add it to the failed urls
				failmsg = 'Error Connecting Page with: ' + url
				failedlinks.append(failmsg)
		except:
			print 'exception with: ' + url
	#figure page times
	cumulative += pagetime
	if pagetime > slowestpage:
		slowestpage = pagetime, url.strip()
	elif pagetime < fastestpage:
		fastestpage = pagetime, url.strip()
	msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \
                '. Currnet runtime: ' + str(datetime.today() - start)
        # status message that updates the same line.
	#PrintStatic(msg)

### Now write out results
end = datetime.today()
finished = datetime.today()
finishedTimeStr = time.asctime()
avg = cumulative/totalNumberTests
failed = len(failedlinks)
nomatches = len(zeromatch)

#setup COM connection to Excel and write the spreadsheet.

If I understand what I've read about threading I need to convert much
of the above into a function and then call threading.thread start or
run to fire off each thread. but where and how and how to limit to X
number of threads is the part I get lost on. The example I've seen
using queues and threads never show using a list (squence) for the
source data and I'm not sure where I'd use the Queue stuff or for that
mattter if I'm just complicating the issue.

Once again thanks for the help.
Greg.




More information about the Python-list mailing list