python and very large data sets???

Aahz aahz at pythoncraft.com
Wed Apr 24 12:58:29 EDT 2002


In article <ad381f5b.0204240841.52333a3e at posting.google.com>,
Rad <zaka07 at hotmail.com> wrote:
>
>I still haven't received above mentioned files so I can't test the
>time needed to (for example) read a 15GB "file1", filter by few
>variables, and write a resulting subset as a "sub_file1".  Things
>would afterwards get more complicated cause I will have to pullout
>ID's from "sub_file1", remove duplicate ID's create
>"no_dup_sub_file1", match those to ID's in remaining 3 main files and
>pullout data linked with those ID's.
>
>I have a few weeks to prepare myself before data arrives and my
>question is: am I going the right way about the project, is Python
>(with humanly written code) capable of doing this kind of staff
>relatively quickly?

Python *can* handle this kind of task, but you'll be much better off if
you interface Python to a database.  The problem is that even 15GB (not
even talking about 80GB) is simply too big to fit in RAM, so you'll need
a way to process partial sets that do fit in RAM.  That's precisely what
a database is designed to do.  Python can help you transform the data
into formats that will fit better into the database, and Python can
drive the operation of the database, but you're going to need to learn
SQL to manage the actual database operation.

Python is certainly better than C for this; depending on the complexity
of the data and operations, you may be able to do this task entirely
within the database, skipping Python completely.

I strongly suggest that you push *VERY* *HARD* to get some small sample
files (100MB to 1GB range).  Get those samples ASAP.

Finally, make sure that you have at least four times the disk space as
the total size of all the files (that's probably a conservative guess,
but you'll definitely need at least 2.5 times).
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

What if there were no rhetorical questions?



More information about the Python-list mailing list