python and very large data sets???
holger krekel
pyth at devel.trillke.net
Thu Apr 25 11:13:05 EDT 2002
On Thu, Apr 25, 2002 at 07:29:27AM -0700, Rad wrote:
> Thanks for all the suggestions and ideas, I'll try to answer to all
> your questions in this one post.
>
> It is a longish term project but after a few initial data extractions
> I'll have some time to re-implement the procedures if there is a need
> to do so.
good :-)
> At this time I don't know for sure how many unique ID's are going to
> be, not less than 15 million I guess.
how lang are the ID's supposedly? Variable length, fixed length?
> Don't have any options on the way data is delivered, it's going to
> come on tape/s and all relevant files will be in fixed width format.
> I am OK with that, shouldn't have any problems reading it in and
> storing it on the HD unless DDS starts clicking :(. I have four 73 GB
> SCSI disks on the computer that will run win2k, 2 GB of RAM, dual xeon
> processor and three power units.
The power units are extremely important :-)
> However, as I just mentioned I don't know anything about MySQL (except that one
> has to pay for it)
where did you hear that? it's wrong as mysql is under an open/free
license.
> having been learning Python in the last few weeks
> I kind of like it + I was encouraged by your posts and I'm now pretty
> convinced that Python2.2 is the way to go.
Be careful. Some people pointed out that in the end it might
be a task which a database can handle more reliably.
> I won't have to access data
> randomly, just pull out the data fitting certain criteria from
> original 4 files and create resulting files. I might decide
> (depending on the size) to put those into a database though.
> I don't understand how I can interface Python to a database?
it's not hard. Try looking for 'mysqldb python' or so on google.
> All four files have dates in them but they do come in YYYYMMDD format
> and I was planning to use string comparisons, same for the rest of
> data,I was thinking to treat it all as strings.
Huh? YYMMDD is easily convertible to a 32bit value. see "help('time')"
on the python-prompt.
> My initial plan is to either split files into smaller files that could
> fit into the memory or read chunks of data from original big file
> using file_object.redlines(some_size_that_ can_ fit_ in_ the_ memory).
Reading the file isn't the problem. See my code example in a previous
posting to see how you can read 'chunkwise'.
> I'm stuck with it (the project) and I'm looking forward to the
> challenge.
You should really start to generate some example files (made up
by you) and start to code *right now*. Don't believe these
academics who tell you that you absolutely need a four-year-plan
or a thesis on 'performance problems of todays computer systems'.
(although the latter might help :-)
Even if you later decide to use a mysql-db you still have to apply
the same basic techniques of reading/writing the files, extracting
ids and writing stuff. And you have to understand what exactly
you want to do no matter if it is python or python+SQL or whatever.
have fun,
holger
More information about the Python-list
mailing list