concurrent file reading/writing using python

Mon Mar 26 18:56:29 EDT 2012

Hi Guys

I am fwding this question from the python tutor list in the hope of
reaching more people experienced in concurrent disk access in python.

I am trying to see if there are ways in which I can read a big file
concurrently on a multi core server and process data and write the
output to a single file as the data is processed.

For example if I have a 50Gb file, I would like to read it in parallel
with 10 process/thread, each working on a 10Gb data and perform the
same data parallel computation on each chunk of fine collating the
output to a single file.

I will appreciate your feedback. I did find some threads about this on
stackoverflow but it was not clear to me what would be a good  way to
go about implementing this.

Thanks!
-Abhi

---------- Forwarded message ----------
From: Steven D'Aprano <steve at pearwood.info>
Date: Mon, Mar 26, 2012 at 3:21 PM
Subject: Re: [Tutor] concurrent file reading using python
To: tutor at python.org

Abhishek Pratap wrote:
>
> Hi Guys
>
>
> I want to utilize the power of cores on my server and read big files
> (> 50Gb) simultaneously by seeking to N locations.

Yes, you have many cores on the server. But how many hard drives is
each file on? If all the files are on one disk, then you will *kill*
performance dead by forcing the drive to seek backwards and forwards:

seek to 12345678
read a block
seek to 9947500
read a block
seek to 5891124
read a block
seek back to 12345678 + 1 block
read another block
seek back to 9947500 + 1 block
read another block
...

The drive will spend most of its time seeking instead of reading.

Even if you have multiple hard drives in a RAID array, performance
will depend strongly the details of how it is configured (RAID1,
RAID0, software RAID, hardware RAID, etc.) and how smart the
controller is.

Chances are, though, that the controller won't be smart enough.
Particularly if you have hardware RAID, which in my experience tends
to be more expensive and less useful than software RAID (at least for
Linux).

And what are you planning on doing with the files once you have read
them? I don't know how much memory your server has got, but I'd be
very surprised if you can fit the entire > 50 GB file in RAM at once.
So you're going to read the files and merge the output... by writing
them to the disk. Now you have the drive trying to read *and* write
simultaneously.

TL; DR:

Tasks which are limited by disk IO are not made faster by using a
faster CPU, since the bottleneck is disk access, not CPU speed.

Back in the Ancient Days when tape was the only storage medium, there
were a lot of programs optimised for slow IO. Unfortunately this is
pretty much a lost art -- although disk access is thousands or tens of
thousands of times slower than memory access, it is so much faster
than tape that people don't seem to care much about optimising disk
access.

> What I want to know is the best way to read a file concurrently. I
> have read about file-handle.seek(),  os.lseek() but not sure if thats
> the way to go. Any used cases would be of help.

Optimising concurrent disk access is a specialist field. You may be
better off asking for help on the main Python list, comp.lang.python
or python-list at python.org, and hope somebody has some experience with
this. But chances are very high that you will need to search the web
for forums dedicated to concurrent disk access, and translate from
whatever language(s) they are using to Python.

--
Steven

_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor