[Tutor] using multiprocessing efficiently to process large data file
Wayne Werner
wayne at waynewerner.com
Sat Sep 1 15:14:53 CEST 2012
On Thu, 30 Aug 2012, Abhishek Pratap wrote:
> Hi Guys
>
> I have a with few million lines. I want to process each block of 8
> lines and from my estimate my job is not IO bound. In other words it
> takes a lot more time to do the computation than it would take for
> simply reading the file.
>
> I am wondering how can I go about reading data from this at a faster
> pace and then farm out the jobs to worker function using
> multiprocessing module.
>
> I can think of two ways.
>
> 1. split the split and read it in parallel(dint work well for me )
> primarily because I dont know how to read a file in parallel
> efficiently.
> 2. keep reading the file sequentially into a buffer of some size and
> farm out a chunks of the data through multiprocessing.
As other folks have mentioned, having at least your general algorithm
available would make things a lot better.
But here's another way that you could iterate over the file if you know
exactly how many you have available (or at least a number that it's
divisible by):
with open('inputfile') as f:
for line1, line2, line3, line4 in zip(f,f,f,f):
# do your processing here
The caveat to this is that if your lines aren't evenly divisible by 4 then
you'll loose the last count % 4 lines.
The reason that this can work is because zip() combines several sequences
and returns a new iterator. In this case it's combining the file handles
f, which are themselves iterators. So each successive call to next() -
i.e. pass through the for loop - next() is successively called on f. The
problem of course is that when you reach the end of the file - say your
last pass through and you've only got one line left. Well, when zip's
iterator calls next on the first `f`, that returns the last line. But
since f is now at the end of the file, calling next on it will raise
StopIteration, which will end your loop without actually processing
anything on the inside!
So, this probably isn't the best way to handle your issue, but maybe it
is!
HTH,
Wayne
More information about the Tutor
mailing list