custom data warehouse in python vs. out-of-the-box ETL tool

nn pruebauno at latinmail.com
Wed Sep 23 11:45:45 EDT 2009


On Sep 22, 4:00 pm, snfctech <tschm... at sacfoodcoop.com> wrote:
> Does anyone have experience building a data warehouse in python?  Any
> thoughts on custom vs using an out-of-the-box product like Talend or
> Informatica?
>
> I have an integrated system Dashboard project that I was going to
> build using cross-vendor joins on existing DBs, but I keep hearing
> that a data warehouse is the way to go.  e.g. I want to create orders
> and order_items with relations to members (MS Access DB), products
> (flat file) and employees (MySQL).
>
> Thanks in advance for any tips.

I use both Python and a Data-warehouse tool (Datastage) from IBM that
is similar to Informatica. The main difference with Python is
throughput. The tool has good sort and join routines of multithreaded
C code that handles data bigger than what fits in RAM. It also has
good native drivers for the DB2 database. For data conversions and
other transformations every row gets processed on a different CPU. You
can really put a 16 core machine to good use with this thing.

In your case you probably won't have enough data to justify the cost
of buying a tool. They are quite expensive.



More information about the Python-list mailing list