General Purpose Pipeline library?

Friedrich Rentsch anthra.norell at bluewin.ch
Wed Nov 22 05:42:20 EST 2017



On 11/22/2017 10:54 AM, Friedrich Rentsch wrote:
>
>
> On 11/21/2017 03:26 PM, Jason wrote:
>> On Monday, November 20, 2017 at 10:49:01 AM UTC-5, Jason wrote:
>>> a pipeline can be described as a sequence of functions that are 
>>> applied to an input with each subsequent function getting the output 
>>> of the preceding function:
>>>
>>> out = f6(f5(f4(f3(f2(f1(in))))))
>>>
>>> However this isn't very readable and does not support conditionals.
>>>
>>> Tensorflow has tensor-focused pipepines:
>>>      fc1 = layers.fully_connected(x, 256, activation_fn=tf.nn.relu, 
>>> scope='fc1')
>>>      fc2 = layers.fully_connected(fc1, 256, 
>>> activation_fn=tf.nn.relu, scope='fc2')
>>>      out = layers.fully_connected(fc2, 10, activation_fn=None, 
>>> scope='out')
>>>
>>> I have some code which allows me to mimic this, but with an implied 
>>> parameter.
>>>
>>> def executePipeline(steps, collection_funcs = [map, filter, reduce]):
>>>     results = None
>>>     for step in steps:
>>>         func = step[0]
>>>         params = step[1]
>>>         if func in collection_funcs:
>>>             print func, params[0]
>>>             results = func(functools.partial(params[0], 
>>> *params[1:]), results)
>>>         else:
>>>             print func
>>>             if results is None:
>>>                 results = func(*params)
>>>             else:
>>>                 results = func(*(params+(results,)))
>>>     return results
>>>
>>> executePipeline( [
>>>                 (read_rows, (in_file,)),
>>>                 (map, (lower_row, field)),
>>>                 (stash_rows, ('stashed_file', )),
>>>                 (map, (lemmatize_row, field)),
>>>                 (vectorize_rows, (field, min_count,)),
>>>                 (evaluate_rows, (weights, None)),
>>>                 (recombine_rows, ('stashed_file', )),
>>>                 (write_rows, (out_file,))
>>>             ]
>>> )
>>>
>>> Which gets me close, but I can't control where rows gets passed in. 
>>> In the above code, it is always the last parameter.
>>>
>>> I feel like I'm reinventing a wheel here.  I was wondering if 
>>> there's already something that exists?
>> Why do I want this? Because I'm tired of writing code that is locked 
>> away in a bespoke function. I'd  have an army of functions all 
>> slightly different in functionality. I require flexibility in 
>> defining pipelines, and I don't want a custom pipeline to require any 
>> low-level coding. I just want to feed a sequence of functions to a 
>> script and have it process it. A middle ground between the shell | 
>> operator and bespoke python code. Sure, I could write many binaries 
>> bound by shell, but there are some things done far easier in python 
>> because of its extensive libraries and it can exist throughout the 
>> execution of the pipeline whereas any temporary persistence  has to 
>> be though environment variables or files.
>>
>> Well after examining your feedback, it looks like Grapevine has 99% 
>> of the concepts that I wanted to invent, even if the | operator seems 
>> a bit clunky. I personally prefer the affluent interface convention. 
>> But this should work.
>>
>> Kamaelia could also work, but it seems a little bit more grandiose.
>>
>>
>> Thanks everyone who chimed in!
>
> This looks very much like I what I have been working on of late: a 
> generic processing paradigm based on chainable building blocks. I call 
> them Workshops, because the base class can be thought of as a workshop 
> that takes some raw material, processes it and delivers the product 
> (to the next in line). Your example might look something like this:
>
>     >>> import workshops as WS
>
>     >>> Vectorizer = WS.Chain (
>             WS.File_Reader (),        # WS provides
>             WS.Map (lower_row),       # WS provides (wrapped builtin)
>             Row_Stasher (),           # You provide
>             WS.Map (lemmatize_row),   # WS provides
>             Row_Vectorizer (),        # Yours
>             Row_Evaluator (),         # Yours
>             Row_Recombiner (),
>             WS.File_Writer (),
>             _name = 'Vectorizer'
>         )
>
>     Parameters are process-control settings that travel through a 
> subscription-based mailing system separate from the payload pipe.
>
>     >>> Vectorizer.post (min_count = ...,  ) # Set all parameters that 
> control the entire run.
>     >>> Vectorizer.post ("File_Writer", file_name = 
> 'output_file_name')    # Addressed, not meant for File_Reader
>
>     Run
>
>     >>> Vectorizer ('input_file_name')    # File Writer returns 0 if 
> the Chain completes successfully.
>     0
>
>     If you would provide a list of your functions (input, output, 
> parameters) I'd be happy to show a functioning solution. Writing a 
> Shop follows a simple standard pattern: Naming the subscriptions, if 
> any, and writing a single method that reads the subscribed parameters, 
> if any, then takes payload, processes it and returns the product.
>
>     I intend to share the system, provided there's an interest. I'd 
> have to tidy it up quite a bit, though, before daring to release it.
>
>     There's a lot more to it . . .
>
> Frederic
>
I'm sorry, I made a mistake with the "From" item. My address is 
obviously not "python-list". It is "anthra.norell at bluewin.ch".

Frederic




More information about the Python-list mailing list