Read from database, write to another database, simultaneously

Sean Davis seandavi at gmail.com
Thu Jan 11 07:27:06 EST 2007



On Jan 10, 9:27 pm, johnf <jfabi... at yolo.com> wrote:
> Bjoern Schliessmann wrote:
> > Sean Davis wrote:
>
> >> The author of one of the python database clients mentioned that
> >> using one thread to retrieve the data from the oracle database and
> >> another to insert the data into postgresql with something like a
> >> pipe between the two threads might make sense, keeping both IO
> >> streams busy.
>
> > IMHO he's wrong. Network interaction is quite slow compared with CPU
> > performance, so there's no gain (maybe even overhead due to thread
> > management and locking stuff). That's true even on multiprocessor
> > machines, not only because there's almost nothing to compute but
> > only IO traffic. CMIIW.
>
> > Using multiplexing, you'll get good results with simple code without
> > the danger of deadlocks. Have a look at asyncore (standard library)
> > or the Twisted framework -- personally, I prefer the latter.
>
> > Regards,
>
> Sean you can't win - everyone has a different idea!  You need to explain
> that oracle has millions of records and it's possible to a pipe open to
> feed the Postgres side.
>
> One thing I didn't get - is this a one time transfer or something that is
> going to happen often.

Yes, some detail about the problem is definitely in order!

We have a collaborator that is going to maintain a large genome
database that is a component of a postgresql database that we currently
maintain.  There are going to be consumers of the oracle data using
both mysql and postgresql.  The oracle database is LARGE with around
100,000,000 rows spread over some 50-70 tables in multiple schemas.
The idea is that as publicly available data (from various datasources
on the web) become available, the oracle team will update the oracle
database, doing all the parsing and necessary data cleanup of the raw
data.  We then want to be able to update postgres with these oracle
data.  So the process may be done only once per month on some tables,
but as often as once a day on others.

As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace).  On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data.  Note the
disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object.  I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in?  So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

I hope that clarifies things.

Thanks,
Sean




More information about the Python-list mailing list