[Tutor] Optimal solution in dealing with huge databases in python

Shadab Sayani shadabsayani at yahoo.com
Thu Jan 25 15:01:36 CET 2007


Hi Alan,
Thank you very much for immediate response.I didnt get
the point of loading the data using SQL.What does that
mean?Does it mean Bulk file load.If so then there will
I/O performance hit right.Writing the parsed tokens in
the csv format in a file.After that only I can bulk
load this file.
Do have any idea about the C api for Postgresql and
some documentation to use it?
Thanks and Regards,
Shadab.
--- Alan Gauld <alan.gauld at btinternet.com> wrote:

> "Shadab Sayani" <shadabsayani at yahoo.com> wrote 
> 
> > The data I need to deal with is  in 100s of GB.
> > I am using postgresql backend and SQLALCHEMY ORM.
> 
> All ORMs will introduce a significant performance
> hit.
> If you really need high speed, and populating a
> 100G+ database 
> probably is included in that, then you should look
> at raw SQL.
> In fact for really big data loads most databases
> provide a 
> C api that goes under the SQL, because even SQL is 
> relatively slow.
> 
> As an example, we use a large Oracle database at
> work.
> Loading about 3 Terabytes of data through an ORM
> took 
> over 3 months! Loading it through SQL took about 3
> days.
> Loading it through the C API took less than a day.
> 
> Your mileage may vary since a lot depends on locks, 
> indexes etc etc. And of course the server spec!
> 
> But for loading large data volumes get as close to
> the metal 
> as possible. Once the data is loaded you can use the
> 
> ORM to simplify the application code for extracting
> and 
> modifying the data.
> 
> > I need  to read the bio datafiles and parse them
> and 
> > then store them in  database.
> 
> Parsing them and preparing the SQL statements can 
> be done in Python. But the actual loading I suggest 
> should be done in SQL if possible (The C aPI should 
> be a last resort - its frought with danger!)
> 
> > Please suggest some viable solution to handle such
> 
> > enormous data from  python.
> 
> A few hundred gigabytes is not too enormous these
> days
> but you are never going to achieve times of less
> than hours.
> You do need to be realistic about that. And if you
> are using
> a standard PC spec server instead of a large multi
> CPU 
> box with SCSI/RAID disc arrays etc then you could be
> 
> looking at days.
> 
> The other important factor is your data schema. The
> more 
> tables, joins, indexes etc the database has to
> maintain the 
> more work it takes and the sloewer it gets. The 3TB
> example 
> I gave had over 2000 tables, so it was always going
> to be 
> slow. If you have a single unindexed table then it
> will be 
> much simpler. (But the queries later will be much
> harder!)
> 
> -- 
> Alan Gauld
> Author of the Learn to Program web site
> http://www.freenetpages.co.uk/hp/alan.gauld
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 


Send instant messages to your online friends http://uk.messenger.yahoo.com 


More information about the Tutor mailing list