python and very large data sets???

Wed Apr 24 16:02:52 EDT 2002

On 24 Apr 2002 09:41:22 -0700, zaka07 at hotmail.com (Rad) wrote:

>I am preparing myself to work on extracting data from 4 text files
>(fixed width format) which combined size is about 80GB.  Considering
How wide a fixed with? I.e., how many records? How much duplicate or
unused fluff is there in the records? How much will wind up in your output?
Is this a one-time thing?

>deadlines, costs, and my limited programming knowledge I thought using
>Python/Windows for the job would be the best option for me.  However,
>I am worried about the speed in which Python (me and my hardware) will
>be able to deal with these massive data sets but I am hoping that this
>is still a quicker route then learning C.
>I still haven't received above mentioned files so I can't test the
Are you going to receive a hard disk with the files on it, or tapes,
(or ~125 CDs ;-) ? Is the data pre-sorted in any way that might help you?
What OS are you running? How much disk space and RAM do you have? What CPU?
How many disks on your system? SCSI or IDE? If they give you a HD, will it
be compatible physically & file system wise? Will your power supply hack it?

>time needed to (for example) read a 15GB "file1", filter by few
>variables, and write a resulting subset as a "sub_file1".  Things
>would afterwards get more complicated cause I will have to pullout
>ID's from "sub_file1", remove duplicate ID's create
>"no_dup_sub_file1", match those to ID's in remaining 3 main files and
How many total unique IDs of what type/form? How much memory do you have or
can you expand to? E.g., if the IDs would not make too big a dictionary
(using all the same value of None for every key), then you might be able
to eliminate duplicates as they happen in one pass, and create "no_dup_sub_file1"
without first creating "sub_file1".

>pullout data linked with those ID's.
>
Does this imply random access to data in those three files? Simultaneously?
I.e., you might want to make an index of ID/file positions. And/or you might want
to sort your access IDs by position so you can get everything in one pass. (OTOH,
what ordering is required in your output? What you do might depend.)

But whether you can do all this easily will depend on how much memory you have and
the number of IDs.

All this stuff will be re-inventing data base stuff, so you might want to consider
going that route, as others have suggested. A flat fixed-width source sounds like an
easy importing/loading problem. I haven't tried MySQL on windows, but on Linux is was
not hard to get going, and it might be faster than a fully transactional DB like
postgresql (I don't know any recent benchmarks). I would guess that you could beat
the disk space requirements by writing a special-purpose program, though.

>I have a few weeks to prepare myself before data arrives and my
>question is: am I going the right way about the project, is Python
>(with humanly written code) capable of doing this kind of staff
>relatively quickly?
>
It doesn't sound that complex from what you've described. Just big.
So it will be important to know how much state you have to accumulate
to do the processing you need, and whether that will fit in memory.
E.g., if you have a billion unique IDs, a naive directory approach
is not likely to fly for detecting duplicates.

How you then go about getting end results efficiently will depend
on what the problem really is ;-)

Your best bet on getting help with that (as well as solving it yourself)
is to specify it as precisely and unambiguously as you can (while assiduously
relegating pre-conceived solutions to footnotes or appendices).

If you post a precise definition of the problem, someone might bite.
Also, if you have any options on the way the data is delivered, you might
want to get some opinions on that before it's too late.

Regards,
Bengt Richter