Read from database, write to another database, simultaneously

Chris Mellon arkanes at gmail.com
Thu Jan 11 13:19:54 EST 2007


On 1/11/07, Dennis Lee Bieber <wlfraed at ix.netcom.com> wrote:
> On 11 Jan 2007 04:49:21 -0800, "Sean Davis" <seandavi at gmail.com>
> declaimed the following in comp.lang.python:
>
>
> >
> > The machine running the script is distinct from the Oracle machine
> > which is distinct from the Postgresql machine.  So, CPU usage is low
>
>         Which somewhat cancels the idea of just using pipes to pass data
> directly... Unless the command-line clients can be run "locally" and
> connect to the remote servers.
>
> > buffer.  A Queue isn't quite right as it stands, as the data is coming
> > in as records, but for postgres loading, a file-like stream is what I
> > need, so there will need to be either a wrapper around the Queue on the
>
>         "need"? No... the "copy ... from ..." statement "needs" such, but do
> you really "need" to use "copy ... from ..." to load the data -- or is
> this just a concept that caught your fancy?
>
>         If  you can get Oracle to dump the data in the form of SQL insert
> statements, maybe with some clean-up done by a script, and feed them to
> PostgreSQL the entire job starts to look something like:
>
> odump theDatabase | python cleanup.py | psql theNewDatabase
>
> {note: command line tool names are made up for the example}
>
>         "copy ... from ..." appears to be designed to work with formatted
> text files... Something that might have been produced as a tabular
> report by almost any tool. As such, you are basically bypassing most of
> the DB-API capability. If Oracle has a "select" variant that can specify
> a start-offset and length (or a server-side cursor in the DB-API so the
> full data is not transferred in one "chunk"), and the PostgreSQL DP-API
> supports an "executemany" operation, the threading pseudocode I showed
> could still be used. Instead of queueing single records, one could queue
> a multiple record "chunk" (a "fetchmany" specifying 100, 500, or
> whatever, records); the data would be a list of tuples, so no concerns
> about parsing fields from a text record. Then an "executemany" would
> submit the chunk at one time.
>
> > get() side.  Or is there a better way to go about this detail?  What
> > seems to make sense to me is to stringify the incoming oracle data into
> > some kind of buffer and then read on the postgresql side.
>
>         My view: avoid using any commands that result in formatting the data
> as text reports... Use DB-API interfaces to keep the data as fields.
>
> ocrsr.execute("select * from table")
> while True:
>         batch = ocrsr.fetchmany(100)
>         if not batch: break
>         pcrsr.executemany("insert into table", batch)
>
> {this is a sequential sample -- but using threads just means the ocrsr
> operation are in one thread, batch is the data relayed via a queue, and
> pcrsr operations are in the other thread}
> --
>         Wulfraed        Dennis Lee Bieber               KD6MOG
>         wlfraed at ix.netcom.com           wulfraed at bestiaria.com
>                 HTTP://wlfraed.home.netcom.com/
>         (Bestiaria Support Staff:               web-asst at bestiaria.com)
>                 HTTP://www.bestiaria.com/


Using db-api for this is the wrong approach. COPY FROM, which is the
postgresql equivilent of oracles SQLLoader, is orders of magnitude
faster than sequential inserts, and it's really what you want for bulk
data loads. What you're trying to do is read & write from the file at
the same time, but across a network interface that's just asking for
trouble. In particular, I doubt that postgres expects to have
something writing to the end of the file it is loading and will
probably error out if it hits an unexpected EOF, which is what will
happen if the writer lags behind.

Just dump the files from oracle, copy them, and load them. It'll
probably be faster in the long run when you count all the time you
lose re-starting the processing and troubleshooting your
"optimization".



More information about the Python-list mailing list