A data transformation framework. A presentation inviting commentary.

Wed Aug 21 16:00:50 EDT 2013

On 8/21/2013 12:29 PM, F.R. wrote:
> Hi all,
>
> In an effort to do some serious cleaning up of a hopelessly cluttered
> working environment, I developed a modular data transformation system
> that pretty much stands. I am very pleased with it. I expect huge time
> savings. I would share it, if had a sense that there is an interest out
> there and would appreciate comments. Here's a description. I named the
> module TX:

You appear to have developed a framework for creating data flow 
networks. Others exists, including Python itself and things built on top 
of Python, like yours. I am not familiar with others built on Python, 
but I would not be surprised if your occupies its own niche. It is easy 
enough to share on PyPI.

> The nucleus of the TX system is a Transformer class, a wrapper for any
> kind of transformation functionality. The Transformer takes input as
> calling argument and returns it transformed. This design allows the
> assembly of transformation chains, either nesting calls or better, using
> the class Chain, derived from 'Transformer' and 'list'.

Python 3 is built around iterables and iterators. Iterables generalize 
the notion of list to any structure that can be sequentially accessed. A 
collection can be either concrete, existing all at once in some memory, 
or abstract, with members created as needed.

One can think of there being two types of iterator. One merely presents 
the items of a collection one at a time. The other transforms items one 
at a time.

The advantage of 'lazy' collections' is that they scale up much better 
to processing, say, a billion items. If your framework keeps the input 
list and all intermediate lists, as you seem to say, then your framework 
is memory constrained. Python (mostly) shifted from list to iterables as 
the common data interchange type partly for this reason.

You are right that keeping data around can help debugging. Without that, 
each iterator must be properly tested if its operation is not transparent.

 > A Chain consists
> of a sequence of Transformers and is functionally equivalent to an
> individual Transformer. A high degree of modularity results: Chains
> nest.

Because iterators are also iterables, they nest. A transformer iterator 
does not care if its input is a concrete non-iterator iterable, a source 
iterator representing an abstract collection, or another transformer.

  Another consequence is that many transformation tasks can be
> handled with a relatively modest library of a few basic prefabricated
> Transformers from which many different Chains can be assembled on the
> fly.

This is precisely the idea of the itertool modules. I suspect that 
itertools.tee is equivalent to Tx.split (from the deleted code). 
Application areas need more specialized iterators. There are many in 
various stdlib modules.

> A custom Transformer to bridge an eventual gap is quickly written
> and tested, because the task likely is trivial.

-- 
Terry Jan Reedy