General Purpose Pipeline library?

Mon Nov 20 11:23:41 EST 2017

On Nov 20, 2017 10:50 AM, "Jason" <jasonhihn at gmail.com> wrote:
>
> a pipeline can be described as a sequence of functions that are applied
to an input with each subsequent function getting the output of the
preceding function:
>
> out = f6(f5(f4(f3(f2(f1(in))))))
>
> However this isn't very readable and does not support conditionals.
>
> Tensorflow has tensor-focused pipepines:
>     fc1 = layers.fully_connected(x, 256, activation_fn=tf.nn.relu,
scope='fc1')
>     fc2 = layers.fully_connected(fc1, 256, activation_fn=tf.nn.relu,
scope='fc2')
>     out = layers.fully_connected(fc2, 10, activation_fn=None, scope='out')
>
> I have some code which allows me to mimic this, but with an implied
parameter.
>
> def executePipeline(steps, collection_funcs = [map, filter, reduce]):
>         results = None
>         for step in steps:
>                 func = step[0]
>                 params = step[1]
>                 if func in collection_funcs:
>                         print func, params[0]
>                         results = func(functools.partial(params[0],
*params[1:]), results)
>                 else:
>                         print func
>                         if results is None:
>                                 results = func(*params)
>                         else:
>                                 results = func(*(params+(results,)))
>         return results
>
> executePipeline( [
>                                 (read_rows, (in_file,)),
>                                 (map, (lower_row, field)),
>                                 (stash_rows, ('stashed_file', )),
>                                 (map, (lemmatize_row, field)),
>                                 (vectorize_rows, (field, min_count,)),
>                                 (evaluate_rows, (weights, None)),
>                                 (recombine_rows, ('stashed_file', )),
>                                 (write_rows, (out_file,))
>                         ]
> )
>
> Which gets me close, but I can't control where rows gets passed in. In
the above code, it is always the last parameter.
>
> I feel like I'm reinventing a wheel here.  I was wondering if there's
already something that exists?

IBM has had for a very long time a program called Pipelines which runs on
IBM mainframes. It does what you want.

A number of attempts have been made to create cross-platform versions of
this marvelous program.

A long time ago I started but never completed an open source python
version. If you are interested in taking a look at this let me know.