python and very large data sets???

Rad zaka07 at hotmail.com
Thu Apr 25 10:29:27 EDT 2002


Thanks for all the suggestions and ideas, I'll try to answer to all
your questions in this one post.

It is a longish term project but after a few initial data extractions
I'll have some time to re-implement the procedures if there is a need
to do so.
At this time I don't know for sure how many unique ID's are going to
be, not less than 15 million I guess.
Don't have any options on the way data is delivered, it's going to
come on tape/s and all relevant files will be in fixed width format. 
I am OK with that, shouldn't have any problems reading it in and
storing it on the HD unless DDS starts clicking :(.  I have four 73 GB
SCSI disks on the computer that will run win2k, 2 GB of RAM, dual xeon
processor and three power units.

I agree that the actual work is not that complex even though there are
quite a few other bits involved in the project that I thought were not
relevant to mention in my earlier post, I also agree that it could be
done in a relational database but if the files were smaller, for
example Access (which I know quite well), I also know some SQL, but I
never used MySQL, which is perhaps a possible route to take.  However,
as I just mentioned I don't know anything about MySQL (except that one
has to pay for it) + having been learning Python in the last few weeks
I kind of like it + I was encouraged by your posts and I'm now pretty
convinced that Python2.2 is the way to go. I won't have to access data
randomly, just pull out the data fitting certain criteria from
original 4 files and create resulting files.  I might decide
(depending on the size) to put those into a database though.
I don't understand how I can interface Python to a database?  

All four files have dates in them but they do come in YYYYMMDD format
and I was planning to use string comparisons, same for the rest of
data,I was thinking to treat it all as strings.

My initial plan is to either split files into smaller files that could
fit into the memory or read chunks of data from original big file
using file_object.redlines(some_size_that_ can_ fit_ in_ the_ memory).

I'm stuck with it (the project) and I'm looking forward to the
challenge.
Hope I don't regret saying this publicly!?

I believe that initial filtering will reduce file size for at least
50% so I'll do some research on awk (for windows).  I guess there is
info about it on the Internet.

Thanks again for all your suggestions and ideas.
Probably "talk" to you soon.



More information about the Python-list mailing list