python and very large data sets???

John Purser NO_SPAM_jmpurser2 at attbi.com
Thu Apr 25 11:28:03 EDT 2002


I did a not dissimilar project recently and like you I didn't feel I had
time to stop and learn how to interface with MySQL before I started.  In my
case I had a 22 meg binary data file and an ASCII file with 156,000 records
of edit information for the data in the binary file.  My first "brute force"
method looked like it was going to take 80 hours to finish!  I tinkered
around with some indexing schemes and then looked into the anydbm module.
What I wound up doing was writing the ASCII file into an anydbm file by key
value on the first pass, then running through the binary file using the key
value to search for edit material in the anydbm file.  Much faster than
brute force and more reliable than my index routines.

Now that customer is happy with the solution and the time pressure is off
I'm looking into using a relational database for projects like these but in
a pinch the anydbm module can get you up and running quickly and for free.

Have fun.

John Purser

"Rad" <zaka07 at hotmail.com> wrote in message
news:ad381f5b.0204250629.5144d196 at posting.google.com...
> Thanks for all the suggestions and ideas, I'll try to answer to all
> your questions in this one post.
>
> It is a longish term project but after a few initial data extractions
> I'll have some time to re-implement the procedures if there is a need
> to do so.
> At this time I don't know for sure how many unique ID's are going to
> be, not less than 15 million I guess.
> Don't have any options on the way data is delivered, it's going to
> come on tape/s and all relevant files will be in fixed width format.
> I am OK with that, shouldn't have any problems reading it in and
> storing it on the HD unless DDS starts clicking :(.  I have four 73 GB
> SCSI disks on the computer that will run win2k, 2 GB of RAM, dual xeon
> processor and three power units.
>
> I agree that the actual work is not that complex even though there are
> quite a few other bits involved in the project that I thought were not
> relevant to mention in my earlier post, I also agree that it could be
> done in a relational database but if the files were smaller, for
> example Access (which I know quite well), I also know some SQL, but I
> never used MySQL, which is perhaps a possible route to take.  However,
> as I just mentioned I don't know anything about MySQL (except that one
> has to pay for it) + having been learning Python in the last few weeks
> I kind of like it + I was encouraged by your posts and I'm now pretty
> convinced that Python2.2 is the way to go. I won't have to access data
> randomly, just pull out the data fitting certain criteria from
> original 4 files and create resulting files.  I might decide
> (depending on the size) to put those into a database though.
> I don't understand how I can interface Python to a database?
>
> All four files have dates in them but they do come in YYYYMMDD format
> and I was planning to use string comparisons, same for the rest of
> data,I was thinking to treat it all as strings.
>
> My initial plan is to either split files into smaller files that could
> fit into the memory or read chunks of data from original big file
> using file_object.redlines(some_size_that_ can_ fit_ in_ the_ memory).
>
> I'm stuck with it (the project) and I'm looking forward to the
> challenge.
> Hope I don't regret saying this publicly!?
>
> I believe that initial filtering will reduce file size for at least
> 50% so I'll do some research on awk (for windows).  I guess there is
> info about it on the Internet.
>
> Thanks again for all your suggestions and ideas.
> Probably "talk" to you soon.





More information about the Python-list mailing list