[Tutor] using multiprocessing efficiently to process large data file

Fri Aug 31 00:35:07 CEST 2012

> I have a with few million lines. I want to process each block of 8
> lines and from my estimate my job is not IO bound. In other words it
> takes a lot more time to do the computation than it would take for
> simply reading the file.
> 
> I am wondering how can I go about reading data from this at a faster
> pace and then farm out the jobs to worker function using
> multiprocessing module.
> 
> I can think of two ways.
> 
> 1. split the split and read it in parallel(dint work well for me )
> primarily because I dont know how to read a file in parallel
> efficiently.
> 2. keep reading the file sequentially into a buffer of some size and
> farm out a chunks of the data through multiprocessing.
> 
> Any example would be of great help.
>

The general logic should work, but did not test with a real file.

with open( file, 'r' ) as f:
    data = f.readlines()
iterdata = iter(data )
grouped_data =[]
for d in iterdata:
    l = [d, next(iterdata)] # make this list 8 elements instead
    grouped_data.append( l )

# batch_process on grouped data

Theoretically you might be able to call next() directly on
the file without doing readlines().

Ramit

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.