[Tutor] using multiprocessing efficiently to process large data file

Wayne Werner wayne at waynewerner.com
Sat Sep 1 15:14:53 CEST 2012


On Thu, 30 Aug 2012, Abhishek Pratap wrote:

> Hi Guys
>
> I have a with few million lines. I want to process each block of 8
> lines and from my estimate my job is not IO bound. In other words it
> takes a lot more time to do the computation than it would take for
> simply reading the file.
>
> I am wondering how can I go about reading data from this at a faster
> pace and then farm out the jobs to worker function using
> multiprocessing module.
>
> I can think of two ways.
>
> 1. split the split and read it in parallel(dint work well for me )
> primarily because I dont know how to read a file in parallel
> efficiently.
> 2. keep reading the file sequentially into a buffer of some size and
> farm out a chunks of the data through multiprocessing.

As other folks have mentioned, having at least your general algorithm 
available would make things a lot better.

But here's another way that you could iterate over the file if you know 
exactly how many  you have available (or at least a number that it's 
divisible by):

with open('inputfile') as f:
     for line1, line2, line3, line4 in zip(f,f,f,f):
         # do your processing here

The caveat to this is that if your lines aren't evenly divisible by 4 then 
you'll loose the last count % 4 lines.

The reason that this can work is because zip() combines several sequences 
and returns a new iterator. In this case it's combining the file handles 
f, which are themselves iterators. So each successive call to next() - 
i.e. pass through the for loop - next() is successively called on f. The 
problem of course is that when you reach the end of the file - say your 
last pass through and you've only got one line left. Well, when zip's 
iterator calls next on the first `f`, that returns the last line. But 
since f is now at the end of the file, calling next on it will raise 
StopIteration, which will end your loop without actually processing 
anything on the inside!

So, this probably isn't the best way to handle your issue, but maybe it 
is!

HTH,
Wayne


More information about the Tutor mailing list