[Tutor] Optimal solution in dealing with huge databases in python

johnf jfabiani at yolo.com
Fri Jan 26 01:39:14 CET 2007


On Thursday 25 January 2007 14:36, Eric Walstad wrote:
> Alan Gauld wrote:
> > "Shadab Sayani" <shadabsayani at yahoo.com> wrote
> >
> >> The data I need to deal with is  in 100s of GB.
> >> I am using postgresql backend and SQLALCHEMY ORM.
> >
> > All ORMs will introduce a significant performance hit.
> > If you really need high speed, and populating a 100G+ database
> > probably is included in that, then you should look at raw SQL.
> > In fact for really big data loads most databases provide a
> > C api that goes under the SQL, because even SQL is
> > relatively slow.
>
> I've had this experience, too.  I routinely load a couple million
> records from a tab delimited file into our PostgreSQL database with
> Python.  My import script:
>  1. iterates over the tab file a line/record at a time
>  2. does some data validation, some data manipulation
>  3. writes an SQL file with the massaged data
>  4. calls psql <dbname> -f <python-generated sql file>
>
> Generating the SQL file in python goes amazingly fast, considering the
> amount of validation and manipulation (normalizing addresses, generating
> hashes, etc) - about 1.5minutes/million records.
>
> The SQL it generates does the usual DB stuff including dropping indexes
> and constraints, COPY FROM stdin, regenerate indexes, replace
> constraints.  In my experience, psql is MUCH faster at these kinds of
> imports than even direct python interaction with psycopg.  Adding an ORM
> on top of the database connector (psycopg in my case) slows things down
> even more.
>
> As a rough metric, the import I described takes a little over 2 hours on
> my 2GHz/1GB laptop.
>
> Good luck,
>
> Eric.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor

I agree - never thought of automating psql for copy...from action.  But it can 
be done.
-- 
John Fabiani


More information about the Tutor mailing list