[Tutor] Optimal solution in dealing with huge databases inpython

Fri Jan 26 15:20:18 CET 2007

On Friday 26 January 2007 00:40, Alan Gauld wrote:
> "Shadab Sayani" <shadabsayani at yahoo.com> wrote
>
> > I got your point.But before inserting data I need to store it
> > into  a file in a format supported by postgresql.Wont this
> > operation incur a  performance hit as it includes writing
> > to a file which is on disk?
>
> Unless your data is already in a format the database
> understands you will have to reformat it before loading it.
> There are basically two options:
> 1) read the unformatted data piece by piece, reformat
> it and load it to the database item by item.
> 2) read the unformatted data and write it to an
> intermediate file in a format supported by the
> database, then load the formatted data in bulk.
>
> The second approach is nearly always faster than
> the first for large data sets. This is due to many things
> including transactional overheads in the first approach,
> caching issues, availability of bulk optimisations in
> the database itself, etc.
>
> Writing to a flat file is much faster than writing to a
> database. Reformatting data is a complex business.
> Python is good at complex processing and writing
> to flat files. SQL is good at writing to databases but poor
> at complex processing. So use Python for its
> strengths and SQL for its strengths and you get
> optimal results.
>
> HTH,
>
> Alan G

This subject has come up several times in the psycopg mailing list.  The 
author of psycopg (Federico Di Gregorio) has suggested the best to handle 
this is as follows:
"When I'd have some more time I'll write a
class to do that without the need to create a temporary file. Now that I
think about it probably the best way would be to use two threads
writing/reading from a pipe, so that the postgresql one doesn't have to
wait for the other and data is always ready." 

In response too:
This is a copout and uses temporary files, but does the trick:

Usage is pretty simple:

orac = ora.cursor()
orac.execute('select * from gene.geneinfo')
orac.arraysize=100000
oraf = OracleFile2(orac)
pgc = pg.cursor()
pgc.copy_from(oraf,'loader.gene_info')

------------------------------------------------
class CursorFile(object):

    def __init__(self,cursor):
        self._cursor = cursor
        self._file = tempfile.TemporaryFile()
        self.write_file()

    def write_file(self):
        print 'fetching'
        dat = self._cursor.fetchmany()
        while len(dat)>0:
            print 'writing'
            for row in dat:
                self._file.writelines("\t".join(map(str,row)) + "\n")
            print "fetching"
            dat = self._cursor.fetchmany()
        print 'done writing'
        self._file.flush()
        self._file.seek(0)

    def read(self,size=-1):
        dat = self._file.read(size)
        return dat

    def readline(self,size=-1):
        return self._file.readline(size)

I hope this helps
-- 
John Fabiani