python and very large data sets???

Terry Reedy tjreedy at udel.edu
Thu Apr 25 12:23:03 EDT 2002


"Rad" <zaka07 at hotmail.com> wrote in message
news:ad381f5b.0204250629.5144d196 at posting.google.com...
> Thanks for all the suggestions and ideas, I'll try to answer to all
> your questions in this one post.
...
> At this time I don't know for sure how many unique ID's are going to
> be, not less than 15 million I guess.

With 2GB, you might then be able to fit an id/disk address index in
memory.

> storing it on the HD unless DDS starts clicking

DDS?

If this is a one time project, I am not sure of the rational for using
AWK also.  The ideas underlying its operation may be useful.

Defragmenting disk before loading big files will optimize reading
time.

Advice based on similar projects with much smaller files:

1. Proceed methodically in reasonable size steps and KEEP A LOG of
what you do.  For each step, list purpose, input file, test or
tranformation script (.py or whatever), and output files (results or
data).

2. If you can write as well as read tapes, store intermediate versions
of data files so you do not have to start over from original data if
you discover a goof several steps along the way.

3. As for testing: I would check that files really are in the format
specified before doing anything: line length correct, ids and dates
valid, etc.  I would start with a small subset and get each test
seemingly right before combining to test multi-gigabyte files.

Terry J. Reedy






More information about the Python-list mailing list