Creating Dict of Dict of Lists with joblib and Multiprocessing

Wed Apr 20 10:43:05 EDT 2016

Hi,

Cross posted at http://stackoverflow.com/questions/36726024/creating-dict-of-dicts-with-joblib-and-multiprocessing, but thought I'd try here too as no responses there so far.

A bit new to python and very new to parallel processing in python.  I have a script that will process a datafile and generate a dict of dicts.  However, as I need to run this task on hundreds to thousands of these files and ultimately collate the data, I thought parallel processing made a lot of sense.  However, I can't seem to figure out how to create a data structure.  Minimal script without all the helper functions:

#!/usr/bin/python
import sys
import os
import re
import subprocess
import multiprocessing
from joblib import Parallel, delayed
from collections import defaultdict
from pprint import pprint

def proc_vcf(vcf,results):
    sample_name = vcf.rstrip('.vcf')
    results.setdefault(sample_name, {})

    # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a list of entries. Expect a dict of dict of lists
    all_vars = run_cmd('vcfExtractor',vcf)
    results[sample_name]['all_vars'] = parse_variant_data(all_vars,'all')

    # Run Helper functions 'run_cmd()' and 'parse_variant_data()' to generate a different list of data based on a different set of criteria.
    mois = run_cmd('moi_report', vcf)
    results[sample_name]['mois'] = parse_variant_data(mois, 'moi')
    return results

def main():
    input_files = sys.argv[1:]

    # collected_data = defaultdict(lambda: defaultdict(dict))
    collected_data = {}

    # Parallel Processing version
    # num_cores = multiprocessing.cpu_count()
    # Parallel(n_jobs=num_cores)(delayed(proc_vcf)(vcf,collected_data) for vcf in input_files)

    # for vcf in input_files:
        # proc_vcf(vcf, collected_data)

    pprint(dict(collected_data))
    return

if __name__=="__main__":
    main()

Hard to provide source data as it's very large, but basically, the dataset will generate a dict of dicts of lists that contain two sets of data for each input keyed by sample and data type:

{ 'sample1' : {
    'all_vars' : [
        'data_val1',
        'data_val2',
        'etc'],
    'mois' : [
        'data_val_x',
        'data_val_y',
        'data_val_z']
    }
    'sample2' : {
       'all_vars' : [
       .
       .
       .
       ]
    }
}

If I run it without trying to multiprocess, not a problem.  I can't figure out how to parallelize this and create the same data structure.  I've tried to use defaultdict to create a defaultdict in main() to pass along, as well as a few other iterations, but I can't seem to get it right (getting key errors, pickle errors, etc.).  Can anyone help me with the proper way to do this?  I think I'm not making / initializing / working with the data structure correctly, but maybe my whole approach is ill conceived?