Improving the web page download code.
mukesh tiwari
mukeshtiwari.iiitm at gmail.com
Tue Aug 27 16:53:30 EDT 2013
On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB wrote:
> On 27/08/2013 20:41, mukesh tiwari wrote:
>
> > Hello All,
>
> > I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download) as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.c
>
> sv
>
> > is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.
>
> >
>
> > -Mukesh Tiwari
>
> >
>
> > [1]http://www.alexa.com/topsites.
>
> >
>
> >
>
> > import urllib2, os, socket, Queue, thread, signal, sys, re
>
> >
>
> >
>
> > class Downloader():
>
> >
>
> > def __init__( self ):
>
> > self.q = Queue.Queue( 200 )
>
> > self.count = 0
>
> >
>
> >
>
> >
>
> > def downloadurl( self ) :
>
> > #open a file in append mode and write the result ( Improvement think of writing in chunks )
>
> > with open('titleoutput.dat', 'a+' ) as file :
>
> > while True :
>
> > try :
>
> > url = self.q.get( )
>
> > data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
>
> > regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
>
> > #Read data line by line and as soon you find the title go out of loop.
>
> > #title = None
>
> > #for r in data:
>
> > # if not r :
>
> > # raise StopIteration
>
> > # else:
>
> > # title = regex.search( r )
>
> > # if title is not None: break
>
> >
>
> > title = regex.search( data )
>
> > result = ', '.join ( [ url , title.group(1) ] )
>
> > #data.close()
>
> > file.write(''.join( [ result , '\n' ] ) )
>
> > except urllib2.HTTPError as e:
>
> > print ''.join ( [ url, ' ', str ( e ) ] )
>
> > except urllib2.URLError as e:
>
> > print ''.join ( [ url, ' ', str ( e ) ] )
>
> > except Exception as e :
>
> > print ''.join ( [ url, ' ', str( e ) ] )
>
> > #With block python calls file.close() automatically.
>
> >
>
> >
>
> > def createurl ( self ) :
>
> >
>
> > #check if file exist. If not then create one with default value of 0 bytes read.
>
> > if os.path.exists('bytesread.dat'):
>
> > f = open ( 'bytesread.dat','r')
>
> > self.count = int ( f.readline() )
>
> >
>
> > else:
>
> > f=open('bytesread.dat','w')
>
> > f.write('0\n')
>
> > f.close()
>
> >
>
> > #Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)
>
> > with open('top-1m.csv', 'r') as file:
>
> > prefix = ''
>
> > file.seek( self.count * 1024 )
>
> > #you will land into the middle of bytes so discard upto newline
>
> > if ( self.count ): file.readline()
>
> > for lines in iter ( lambda : file.read( 1024 ) , ''):
>
> > l = lines.split('\n')
>
> > n = len ( l )
>
> > l[0] = ''.join( [ prefix , l[0] ] )
>
> > for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) )
>
> > prefix = l[n-1]
>
> > self.count += 1
>
> >
>
> >
>
> > #do graceful exit from here.
>
> > def handleexception ( self , signal , frame) :
>
> > with open('bytesread.dat', 'w') as file:
>
> > print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
>
> > file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
>
> > file.close()
>
> > sys.exit(0)
>
> >
>
> > if __name__== '__main__':
>
> > u = Downloader()
>
> > signal.signal( signal.SIGINT , u.handleexception)
>
> > thread.start_new_thread ( u.createurl , () )
>
> > for i in xrange ( 5 ) :
>
> > thread.start_new_thread ( u.downloadurl , () )
>
> > while True : pass
>
> >
>
> >
>
> My preferred method when working with background threads is to put a
>
> sentinel such as None at the end and then when a worker gets an item
>
> from the queue and sees that it's the sentinel, it puts it back in the
>
> queue for the other workers to see, and then returns (terminates). The
>
> main thread can then call each worker thread's .join method to wait for
>
> it to finish. You currently have the main thread running in a 'busy
>
> loop', consuming processing time doing nothing!
Hi MRAB,
Thank you for the reply. I wrote this while loop only because of there is no thread.join in thread[1] library but I got your point. I am simply running a while loop for doing nothing. So if somehow I can block the main without too much computation then it will great.
-Mukesh Tiwari
[1] http://docs.python.org/2/library/thread.html#module-thread
More information about the Python-list
mailing list