parallel csv-file processing

Fri Nov 9 06:02:34 EST 2007

Michel Albert <exhuma at gmail.com> writes:
> buffer = []
> for line in reader:
>    buffer.append(line)
>    if len(buffer) == 1000:
>       f = job_server.submit(calc_scores, buffer)
>       buffer = []
> 
> f = job_server.submit(calc_scores, buffer)
> buffer = []
> 
> but would this not kill my memory if I start loading bigger slices
> into the "buffer" variable?

Why not pass the disk offsets to the job server (untested):

   n = 1000
   for i,_ in enumerate(reader):
     if i % n == 0:
       job_server.submit(calc_scores, reader.tell(), n)

the remote process seeks to the appropriate place and processes n lines
starting from there.