custom data warehouse in python vs. out-of-the-box ETL tool

snfctech tschmidt at sacfoodcoop.com
Tue Sep 22 20:23:24 EDT 2009


Thanks for your replies, Sean and Martin.

I agree that the ETL tools are complex in themselves, and I may as
well spend that learning curve on a lower-level tool-set that has the
added value of greater flexibility.

Can you suggest a good book or tutorial to help me build a data
warehouse in python?  Bill Inmon's "Building the Data Warehouse" is 17
years old, and I've been cautioned against Kimball.

Thanks.


On Sep 22, 3:40 pm, "Martin P. Hellwig" <martin.hell... at dcuktec.org>
wrote:
> snfctech wrote:
> > Does anyone have experience building a data warehouse in python?  Any
> > thoughts on custom vs using an out-of-the-box product like Talend or
> > Informatica?
>
> > I have an integrated system Dashboard project that I was going to
> > build using cross-vendor joins on existing DBs, but I keep hearing
> > that a data warehouse is the way to go.  e.g. I want to create orders
> > and order_items with relations to members (MS Access DB), products
> > (flat file) and employees (MySQL).
>
> > Thanks in advance for any tips.
>
> My experience is that if you enjoy hacking around databases and are
> proficient in Python, than for small scale solutions it is preferable to
> do it yourself. If you need a large scale solutions with advanced
> requirements, building it yourself is mostly the only way.
>
> I have build a rather complex datawarehouse system in the past (well
> actually more like a centralised synchronisation hub, having input and
> reporting databases as satellite clients), shoving data from around 500
> databases (typically 5 Gb in size each) spread over the world.
>
> The only commercial solutions I reviewed was Business Objects Data
> Integrator and Oracle Warehouse Builder.
>
> These tools where quite flexible and if you already have a license deal
> which includes these tools I would definitely recommend to have more
> than just a look at it.
>
> If not and you are comfortably with using python to shovel data from A
> to B and transform it at the same time (moving relational data
> automatically into a EAV model and back again, for example) than
> building your own solution will probably save you money and time (as
> opposed to learn how to use that ETL tool).
>
> This will require you to have at least interest in the following subjects:
> - Authorization, may everybody use all data or should it be limited to a
> subset on the data depending on the data?
> (My solution was one centralised hub which contains all data but is only
> accessible to special 'client' servers strictly maintained by me which
> only sync the data relevant to them).
>
> - Authenticity, if you have different values for the same thing, which
> one should be considered authoritative and if yes may it be pushed back
> to the un-authoritative?
>
> -Synchronisation, you really don't want to push/pull all of the database
> content over every x times, so how can you delta it and is there a way
> to do this only when the data changes (push vs pull)?
>
> -ATOMIC, how long may the data be out of date and is it allowed to
> partially update
>
> -Using and maintaining multiple databases, hopefully spread over
> multiple systems. I had a server for each production DB, a server that
> mirrored that production DB with some added columns per table for
> external synchronization purposes and a master synchronisation server
> (so in essence all data was copied three times, not very efficient but
> good if you like to play it on the safe side).
>
> --
> MPHhttp://blog.dcuktec.com
> 'If consumed, best digested with added seasoning to own preference.'




More information about the Python-list mailing list