python and very large data sets???

Wed Apr 24 20:49:49 EDT 2002

[Rad (zaka07 at hotmail.com)]

I am preparing myself to work on extracting data from 4 text files
(fixed width format) which combined size is about 80GB.  Considering
deadlines, costs, and my limited programming knowledge I thought using
Python/Windows for the job would be the best option for me.  However,
I am worried about the speed in which Python (me and my hardware) will
be able to deal with these massive data sets but I am hoping that this
is still a quicker route then learning C.
I still haven't received above mentioned files so I can't test the
time needed to (for example) read a 15GB "file1", filter by few
variables, and write a resulting subset as a "sub_file1".  Things
would afterwards get more complicated cause I will have to pullout
ID's from "sub_file1", remove duplicate ID's create
"no_dup_sub_file1", match those to ID's in remaining 3 main files and
pullout data linked with those ID's.

JM-> [JM == John Machin] Here is the short answer:
Somebody who can afford to accumulate 80GB of data should be able to
afford an IT professional to work on it. IMHO (80GB of unknown data) +
Windows + SQL + mmap + grep + awk + newbie - Python produces a high
potential-disaster rating.

However in case you are stuck with it, here are some comments on the
details:

[Aahz (aahz at pythoncraft.com)]

Python *can* handle this kind of task, but you'll be much better off
if
you interface Python to a database.  The problem is that even 15GB
(not
even talking about 80GB) is simply too big to fit in RAM, so you'll
need
a way to process partial sets that do fit in RAM.

JM-> 15GB of *fixed width* records (presumably lots of trailing spaces
and leading zeroes) could shrink considerably when only the relevant
info is considered.

JM-> *** IMPORTANT ***
(1) How many rows/records in each file??? How much relevant info in
each record???
(2) This exercise is "doable" in a reasonable run-time IF a Python
dictionary of the keys of the records in "no_dup_sub_file1" can be
held in real (non-swapped) memory.
(3) What is the definition of "duplicate IDs"??? Does it involve
"fuzzy matching" as in deduplication of mailing lists?

[Aahz]
I strongly suggest that you push *VERY* *HARD* to get some small
sample
files (100MB to 1GB range).  Get those samples ASAP.

JM-> Even 10 MB might do if you get every nth record. Don't believe
any specification you are given. To quote my son on arrival back from
a database integration job that took 10 days instead of 3: "Users are
bastards. They lie to you".

[Aahz]
Finally, make sure that you have at least four times the disk space as
the total size of all the files (that's probably a conservative guess,
but you'll definitely need at least 2.5 times).

JM-> Maybe not so much if you do things sequentially. However it's
still a whole lot more disk space than is usually associated with a
non-server Windows box. Interesting question: on what medium is the
80GB arriving?

[holger krekel (pyth at devel.trillke.net)]

I disagree. Don't be so humble :-)
Using a database requires

- setting up/configuring an appropriate database
  for the task (may not be easy)

- getting the info from the files into a database
  (requires reading the file anyway!)

- reading from the database with partial result sets
  and convert again to a file.

i think this is errorprone and quite complex.

JM-> If Rad is new to programming he might just be able to do this in
Python if the job is a one-off and is as simple as it seems to be.
However Murphy's Law will operate. Learning SQL and configuring a
database is certainly not newbie territory.

[holger]
why not a more pythonic way like this:

- use module mmap [snip]

JM-> I don't see what this gives you, except complication. Remember
Rad is
a newbie.

[Holger]
it is important to know what your time constraints are.
10 minutes, an hour, a night?

JM-> I'd multiply these numbers by 10, at least, for a one-off or
first-time effort.

[David Bolen (db3l at fitlinxx.com) gives *lots* of good advice]

JM-> However installing Cygwin would be yet another bump in the
learning curve for Rad the newbie.

[Neal Norwitz (neal at metaslash.com) gives upshifts a 650MB file]

JM-> I expect Rad will have to do some more complicated
transformations than upshifting. Even superficially simple things like
converting strings to internal integer or date format so that they can
be compared will add considerably to the run time.

[Neal]
for x in ifile:
        ofile.write(x.upper())
JM->
write_out = ofile.write
for x in ifile:
        write_out(x.upper())

[Fernando Pérez (fperez528 at yahoo.com)]

May I suggest that you also pick up some basic (4 hrs worth) grep/awk
knowledge?

JM-> Yes, these are basic programmer tools. You don't need to install
Cygwin. Just get the tools that you need from the GNUWin32 site.

[Bo (bosahvremove at netscape.net)]

If this is a recurring requirement at work I would consider hiring a
contractor to create an elegant solution with defined steps for you to
operate it.

JM-> Generally agree, but I'd s/elegant/robust/

Some further comments:

Does the data include dates? Do you need to select data by date
comparison?
If you are lucky, dates will be in YYYYMMDD or YYYY-MM-DD format and
you can use a string comparison. Unlucky: MM/DD/YYYY format -- you
will need to convert to a "comparable" form. Do you need to do any
date calculations? If so, grab a copy of the mxDateTime package; it's
implemented in C. Python-based date modules will be far too slow,
protestations from their authors notwithstanding :-)

Write yourself an every-nth-record selector in Python. You will need
it to produce reduced datasets for eyeballing and testing. ***Reality
check***: If you find writing such a thing within 30 minutes to be a
struggle, activate Plan B [hire a professional] now.

HTH,
John