Improving the web page download code.

Tue Aug 27 16:53:30 EDT 2013

On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB  wrote:
> On 27/08/2013 20:41, mukesh tiwari wrote:
> 
> > Hello All,
> 
> > I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download)  as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.c
> 
> sv
> 
> >   is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.
> 
> >
> 
> > -Mukesh Tiwari
> 
> >
> 
> > [1]http://www.alexa.com/topsites.
> 
> >
> 
> >
> 
> > import urllib2, os, socket, Queue, thread, signal, sys, re
> 
> >
> 
> >
> 
> > class Downloader():
> 
> >
> 
> > 	def __init__( self ):
> 
> > 		self.q = Queue.Queue( 200 )
> 
> > 		self.count = 0
> 
> > 	
> 
> >
> 
> >
> 
> > 	def downloadurl( self ) :
> 
> > 		#open a file in append mode and write the result ( Improvement think of writing in chunks )
> 
> > 		with open('titleoutput.dat', 'a+' ) as file :	
> 
> > 			while True :
> 
> > 				try :
> 
> > 					url = self.q.get( )
> 
> > 					data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
> 
> > 					regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
> 
> > 					#Read data line by line and as soon you find the title go out of loop.
> 
> > 					#title = None
> 
> > 					#for r in data:
> 
> > 					#	if not r :
> 
> > 					#		raise StopIteration
> 
> > 					#	else:
> 
> > 					#		title = regex.search( r )
> 
> > 					#		if title is not None: break
> 
> >
> 
> > 					title = regex.search( data )
> 
> > 					result =  ', '.join ( [ url , title.group(1) ] )
> 
> > 					#data.close()
> 
> > 					file.write(''.join( [ result , '\n' ] ) )
> 
> > 				except urllib2.HTTPError as e:
> 
> > 				       print ''.join ( [ url, '  ', str ( e ) ] )
> 
> > 				except urllib2.URLError as e:
> 
> > 					print ''.join ( [ url, '  ', str ( e ) ] )
> 
> > 				except Exception as e :
> 
> > 					print ''.join ( [ url, '  ', str( e )  ] )
> 
> > 			#With block python calls file.close() automatically.		
> 
> > 				
> 
> >
> 
> > 	def createurl ( self ) :
> 
> >
> 
> > 		#check if file exist. If not then create one with default value of 0 bytes read.
> 
> > 		if os.path.exists('bytesread.dat'):
> 
> > 			f = open ( 'bytesread.dat','r')
> 
> > 			self.count = int ( f.readline() )
> 
> > 					
> 
> > 		else:
> 
> > 			f=open('bytesread.dat','w')
> 
> > 			f.write('0\n')
> 
> > 			f.close()
> 
> >
> 
> > 		#Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)
> 
> > 		with open('top-1m.csv', 'r') as file:
> 
> > 			prefix = ''
> 
> > 			file.seek(  self.count * 1024 )
> 
> > 			#you will land into the middle of bytes so discard upto newline
> 
> > 			if ( self.count ): file.readline()	
> 
> > 			for lines in iter ( lambda : file.read( 1024 ) , ''):
> 
> > 				l = lines.split('\n')
> 
> > 				n = len ( l )
> 
> > 				l[0] = ''.join( [ prefix , l[0] ] )
> 
> > 				for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) )
> 
> > 				prefix = l[n-1]
> 
> > 				self.count += 1
> 
> >
> 
> > 			
> 
> > 	#do graceful exit from here.
> 
> > 	def handleexception ( self , signal , frame) :
> 
> > 		with open('bytesread.dat', 'w') as file:
> 
> > 			print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
> 
> > 			file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
> 
> > 			file.close()			
> 
> > 			sys.exit(0)
> 
> >
> 
> > if __name__== '__main__':
> 
> > 	u = Downloader()
> 
> > 	signal.signal( signal.SIGINT , u.handleexception)
> 
> > 	thread.start_new_thread ( u.createurl , () )
> 
> > 	for i in xrange ( 5 ) :
> 
> > 		thread.start_new_thread ( u.downloadurl , () )
> 
> > 	while True : pass
> 
> > 			
> 
> >
> 
> My preferred method when working with background threads is to put a 
> 
> sentinel such as None at the end and then when a worker gets an item 
> 
> from the queue and sees that it's the sentinel, it puts it back in the 
> 
> queue for the other workers to see, and then returns (terminates). The 
> 
> main thread can then call each worker thread's .join method to wait for 
> 
> it to finish. You currently have the main thread running in a 'busy 
> 
> loop', consuming processing time doing nothing!

Hi MRAB,
Thank you for the reply. I wrote this while loop only because of there is no thread.join in thread[1] library but I got your point. I am simply running a while loop for doing nothing. So if somehow I can block the main without too much computation then it will great. 

-Mukesh Tiwari

[1] http://docs.python.org/2/library/thread.html#module-thread