[scikit-image] Image analysis pipeline improvement suggestions

Thu Jan 5 10:02:27 EST 2017

Hi Simone,

unfortunately I can not contribute anything to your questions but would you
mind to share some of your experience with the hdf5 file format and
parallelization?
I am also analysing a lot of big microscopy images and would love to hear
your opinion or get advice for an image pipeline.

Best regards,
Stefanie

2016-12-28 23:37 GMT+01:00 Nathan Faggian <nathan.faggian at gmail.com>:

> Hi Simone,
>
> I have had a little experience with HDF5 and am interested to see where
> you go with this. I wonder if you could use "feather":
>     https://github.com/wesm/feather
>
> There was a recent post from Wes McKinney about feather, which sparked my
> interest:
>    http://wesmckinney.com/blog/high-perf-arrow-to-pandas/
>
> Do you use HDF5 to store intermediates? if so, I would try storing
> intermediates to a file format like feather and then reducing to a HDF5
> file at the end. The reduction should be IO bound and not dependent on RAM
> so would suit your cluster.
>
> If you need to read a large array then I think HDF5 supports that (for
> single write but multiple reads) without the need for MPI - so this could
> map well to a tool like distributed:
>     http://distributed.readthedocs.io/en/latest/
>
> Not sure this helps, there is an assumption (on my part) that your
> intermediate calculations are not terabytes in size.
>
> Good luck!
>
> Nathan
>
>
> On 29 December 2016 at 05:07, simone codeluppi <simone.codeluppi at gmail.com
> > wrote:
>
>> Hi all!
>>
>> I would like to pick your brain for some suggestion on how to modify my
>> image analysis pipeline.
>>
>> I am analyzing terabytes of image stacks generated using a microscope.
>> The current code I generated rely heavily on scikit-image, numpy and scipy.
>> In order to speed up the analysis the code runs on a HPC computer (
>> https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for
>> parallelization and hdf5 (h5py) for file storage. The development cycle of
>> the code has been pretty painful mainly due to my non familiarity with mpi
>> and problems in compiling parallel hdf5 (with many open/closing bugs).
>> However, the big drawback is that each core has only 2Gb of RAM (no shared
>> ram across nodes) and in order to run some of the processing steps i ended
>> up reserving one node (16 cores) but running only 3 cores in order to have
>> enough ram (image chunking won’t work in this case). As you can imagine
>> this is extremely inefficient and i end up getting low priority in the
>> queue system.
>>
>>
>> Our lab currently bought a new 4 nodes server with shared RAM running
>> hadoop. My goal is to move the parallelization of the processing to dask. I
>> tested it before in another system and works great. The drawback is that,
>> if I understood correctly, parallel hdf5 works only with MPI
>> (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in
>> keeping a good structure of the data and i can save everything as numpy
>> arrays….very handy.
>>
>>
>> If I will move to hadoop/dask what do you think will be a good solution
>> for data storage? Do you have any additional suggestion that can improve
>> the layout of the pipeline? Any help will be greatly appreciated.
>>
>>
>> Simone
>> --
>> *Bad as he is, the Devil may be abus'd,*
>> *Be falsy charg'd, and causelesly accus'd,*
>> *When men, unwilling to be blam'd alone,*
>> *Shift off these Crimes on Him which are their*
>> *Own*
>>
>>                                                       *Daniel Defoe*
>>
>> simone.codeluppi at gmail.com
>>
>> simone at codeluppi.org
>>
>>
>> _______________________________________________
>> scikit-image mailing list
>> scikit-image at python.org
>> https://mail.python.org/mailman/listinfo/scikit-image
>>
>>
>
> _______________________________________________
> scikit-image mailing list
> scikit-image at python.org
> https://mail.python.org/mailman/listinfo/scikit-image
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-image/attachments/20170105/b8c8e1eb/attachment.html>