custom data warehouse in python vs. out-of-the-box ETL tool

Martin P. Hellwig martin.hellwig at dcuktec.org
Tue Sep 22 18:40:15 EDT 2009


snfctech wrote:
> Does anyone have experience building a data warehouse in python?  Any
> thoughts on custom vs using an out-of-the-box product like Talend or
> Informatica?
> 
> I have an integrated system Dashboard project that I was going to
> build using cross-vendor joins on existing DBs, but I keep hearing
> that a data warehouse is the way to go.  e.g. I want to create orders
> and order_items with relations to members (MS Access DB), products
> (flat file) and employees (MySQL).
> 
> Thanks in advance for any tips.

My experience is that if you enjoy hacking around databases and are 
proficient in Python, than for small scale solutions it is preferable to 
do it yourself. If you need a large scale solutions with advanced 
requirements, building it yourself is mostly the only way.

I have build a rather complex datawarehouse system in the past (well 
actually more like a centralised synchronisation hub, having input and 
reporting databases as satellite clients), shoving data from around 500 
databases (typically 5 Gb in size each) spread over the world.

The only commercial solutions I reviewed was Business Objects Data 
Integrator and Oracle Warehouse Builder.

These tools where quite flexible and if you already have a license deal 
which includes these tools I would definitely recommend to have more 
than just a look at it.

If not and you are comfortably with using python to shovel data from A 
to B and transform it at the same time (moving relational data 
automatically into a EAV model and back again, for example) than 
building your own solution will probably save you money and time (as 
opposed to learn how to use that ETL tool).

This will require you to have at least interest in the following subjects:
- Authorization, may everybody use all data or should it be limited to a 
subset on the data depending on the data?
(My solution was one centralised hub which contains all data but is only 
accessible to special 'client' servers strictly maintained by me which 
only sync the data relevant to them).

- Authenticity, if you have different values for the same thing, which 
one should be considered authoritative and if yes may it be pushed back 
to the un-authoritative?

-Synchronisation, you really don't want to push/pull all of the database 
content over every x times, so how can you delta it and is there a way 
to do this only when the data changes (push vs pull)?

-ATOMIC, how long may the data be out of date and is it allowed to 
partially update

-Using and maintaining multiple databases, hopefully spread over 
multiple systems. I had a server for each production DB, a server that 
mirrored that production DB with some added columns per table for 
external synchronization purposes and a master synchronisation server 
(so in essence all data was copied three times, not very efficient but 
good if you like to play it on the safe side).


-- 
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'



More information about the Python-list mailing list