Processing a file using multithreads

aspineux aspineux at gmail.com
Fri Sep 9 00:44:34 EDT 2011


On Sep 9, 12:49 am, Abhishek Pratap <abhishek.... at gmail.com> wrote:
> Hi Guys
>
> My experience with python is 2 days and I am looking for a slick way
> to use multi-threading to process a file. Here is what I would like to
> do which is somewhat similar to MapReduce in concept.
>
> # test case
>
> 1. My input file is 10 GB.
> 2. I want to open 10 file handles each handling 1 GB of the file
> 3. Each file handle is processed in by an individual thread using the
> same function ( so total 10 cores are assumed to be available on the
> machine)
> 4. There will be 10 different output files
> 5. once the 10 jobs are complete a reduce kind of function will
> combine the output.
>
> Could you give some ideas ?

You can use "multiprocessing" module instead of thread to bypass the
GIL limitation.

First cut your file in 10 "equal" parts. If it is line based search
for the first line
close to the cut. Be sure to have "start" and "end" for each parts,
start is the address of the
first character of the first line and end is one line too much (==
start of the next block)

Then use this function to handle each part .

def handle(filename, start, end)
  f=open(filename)
  f.seek(start)
  for l in f:
    start+=len(l)
    if start>=end:
      break
    # handle line l here
    print l

Do it first in a single process/thread to be sure this is ok (easier
to debug) then split in multi processes


>
> So given a file I would like to read it in #N chunks through #N file
> handles and process each of them separately.
>
> Best,
> -Abhi




More information about the Python-list mailing list