Creating Dict of Dict of Lists with joblib and Multiprocessing

Wed Apr 20 11:17:31 EDT 2016

On Wed, Apr 20, 2016 at 10:50 AM Sims, David (NIH/NCI) [C] <
david.sims2 at nih.gov> wrote:

> Hi,
>
> Cross posted at
> http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing,
> but thought I'd try here too as no responses there so far.
>
> A bit new to python and very new to parallel processing in python.  I have
> a script that will process a datafile and generate a dict of dicts.
> However, as I need to run this task on hundreds to thousands of these files
> and ultimately collate the data, I thought parallel processing made a lot
> of sense.  However, I can't seem to figure out how to create a data
> structure.  Minimal script without all the helper functions:
>
> #!/usr/bin/python
> import sys
> import os
> import re
> import subprocess
> import multiprocessing
> from joblib import Parallel, delayed
> from collections import defaultdict
> from pprint import pprint
>
> def proc_vcf(vcf,results):
>     sample_name = vcf.rstrip('.vcf')
>     results.setdefault(sample_name, {})
>
>     # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to
> generate a list of entries. Expect a dict of dict of lists
>     all_vars = run_cmd('vcfExtractor',vcf)
>     results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all')
>
>     # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to
> generate a different list of data based on a different set of criteria.
>     mois = run_cmd('moi_report', vcf)
>     results[sample_name]['mois'] = parse_variant_data(mois, 'moi')
>     return results
>
> def main():
>     input_files = sys.argv[1:]
>
>     # collected_data = defaultdict(lambda: defaultdict(dict))
>     collected_data = {}
>
>     # Parallel Processing version
>     # num_cores = multiprocessing.cpu_count()
>     # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for
> vcf in input_files)
>
>     # for vcf in input_files:
>         # proc_vcf(vcf, collected_data)
>
>     pprint(dict(collected_data))
>     return
>
> if __name__=="__main__":
>     main()
>
>
> Hard to provide source data as it's very large, but basically, the dataset
> will generate a dict of dicts of lists that contain two sets of data for
> each input keyed by sample and data type:
>
> { 'sample1' : {
>     'all_vars' : [
>         'data_val1',
>         'data_val2',
>         'etc'],
>     'mois' : [
>         'data_val_x',
>         'data_val_y',
>         'data_val_z']
>     }
>     'sample2' : {
>        'all_vars' : [
>        .
>        .
>        .
>        ]
>     }
> }
>
> If I run it without trying to multiprocess, not a problem.  I can't figure
> out how to parallelize this and create the same data structure.  I've tried
> to use defaultdict to create a defaultdict in main() to pass along, as well
> as a few other iterations, but I can't seem to get it right (getting key
> errors, pickle errors, etc.).  Can anyone help me with the proper way to do
> this?  I think I'm not making / initializing / working with the data
> structure correctly, but maybe my whole approach is ill conceived?
>

Processes cannot share memory, so your collected_data is only copied once,
at the time you pass it to each subprocess. There's an undocumented
ThreadPool that works the same as the process Pool (
https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers
)

ThreadPool will share memory across your subthreads. In the example I liked
to, just replace ``from multiprocessing import Pool`` with ``from
multiprocessing.pool import ThreadPool``.

How compute-intensive is your task? If it's mostly disk-read-intensive
rather than compute-intensive, then threads is all you need.