Transforming ascii file (pseduo database) into proper database

Paul Rubin http
Mon Jan 21 18:08:30 EST 2008


"p." <ppetrick at gmail.com> writes:
> So as an exercise, lets assume 800MB file, each line of data taking up
> roughly 150B (guesstimate - based on examination of sample data)...so
> roughly 5.3 million unique IDs.

I still don't understand what the problem is.  Are you familiar with
the concept of external sorting?  What OS are you using?  If you're
using a Un*x-like system, the built-in sort command should do what you
need.  "Internal" sorting means reading a file into memory and sorting
it in memory with something like the .sort() function.  External
sorting is what you do when the file won't fit in memory.  Basically
you read sequential chunks of the file where each chunk fits in
memory, sort each chunk internally and write it to a temporary disk
file, then merge all the disk files.  You can sort inputs of basically
unlimited size this way.  The unix sort command knows how to do this.

It's often a good exercise with this type of problem, to ask yourself
how an old-time mainframe programmer would have done it.  A "big"
computer of the 1960's might have had 128 kbytes of memory and a few
MB of disk, but a bunch of magtape drives that held a few dozen MB
each.  With computers like that, they managed to process the phone
bills for millions of people.  The methods that they used are still
relevant with today's much bigger and faster computers.

If you watch old movies that tried to get a high tech look by showing
computer machine rooms with pulsating tape drives, external sorting is
what those computers spent most of their time doing.

Finally, 800MB isn't all that big a file by today's standards.  Memory
for desktop computers costs around 25 dollars per gigabyte so having
8GB of ram on your desk to crunch those 800MB files with is not at all
unreasonable.



More information about the Python-list mailing list