[Tutor] using multiprocessing efficiently to process large data file

Prasad, Ramit ramit.prasad at jpmorgan.com
Fri Aug 31 15:41:38 CEST 2012


Please always respond to the list. And avoid top posting.

> -----Original Message-----
> From: Abhishek Pratap [mailto:abhishek.vit at gmail.com]
> Sent: Thursday, August 30, 2012 5:47 PM
> To: Prasad, Ramit
> Subject: Re: [Tutor] using multiprocessing efficiently to process large data
> file
> 
> Hi Ramit
> 
> Thanks for your quick reply. Unfortunately given the size of the file
> I  cant afford to load it all into memory at one go.
> I could read, lets say first 1 million lines process them in parallel
> and so on. I am looking for some example which does something similar.
> 
> -Abhi
> 

The same logic should work just process your batch after checking size
and iterate over the file directly instead of reading in memory.

with open( file, 'r' ) as f:
    iterdata = iter(f)
    grouped_data =[]
    for d in iterdata:
        l = [d, next(iterdata)] # make this list 8 elements instead
        grouped_data.append( l )
        if len(grouped_data) > 1000000/8: # one million lines
            # process batch
            grouped_data = []


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  


More information about the Tutor mailing list