python and very large data sets???

Thu Apr 25 11:13:05 EDT 2002

On Thu, Apr 25, 2002 at 07:29:27AM -0700, Rad wrote:
> Thanks for all the suggestions and ideas, I'll try to answer to all
> your questions in this one post.
> 
> It is a longish term project but after a few initial data extractions
> I'll have some time to re-implement the procedures if there is a need
> to do so.

good :-)

> At this time I don't know for sure how many unique ID's are going to
> be, not less than 15 million I guess.

how lang are the ID's supposedly? Variable length, fixed length?

> Don't have any options on the way data is delivered, it's going to
> come on tape/s and all relevant files will be in fixed width format. 
> I am OK with that, shouldn't have any problems reading it in and
> storing it on the HD unless DDS starts clicking :(.  I have four 73 GB
> SCSI disks on the computer that will run win2k, 2 GB of RAM, dual xeon
> processor and three power units.

The power units are extremely important :-)

> However, as I just mentioned I don't know anything about MySQL (except that one
> has to pay for it)

where did you hear that? it's wrong as mysql is under an open/free
license. 

> having been learning Python in the last few weeks
> I kind of like it + I was encouraged by your posts and I'm now pretty
> convinced that Python2.2 is the way to go. 

Be careful. Some people pointed out that in the end it might
be a task which a database can handle more reliably. 

> I won't have to access data
> randomly, just pull out the data fitting certain criteria from
> original 4 files and create resulting files.  I might decide
> (depending on the size) to put those into a database though.
> I don't understand how I can interface Python to a database?  

it's not hard. Try looking for 'mysqldb python' or so on google.

> All four files have dates in them but they do come in YYYYMMDD format
> and I was planning to use string comparisons, same for the rest of
> data,I was thinking to treat it all as strings.

Huh? YYMMDD is easily convertible to a 32bit value. see "help('time')"
on the python-prompt.

> My initial plan is to either split files into smaller files that could
> fit into the memory or read chunks of data from original big file
> using file_object.redlines(some_size_that_ can_ fit_ in_ the_ memory).

Reading the file isn't the problem. See my code example in a previous
posting to see how you can read 'chunkwise'.

> I'm stuck with it (the project) and I'm looking forward to the
> challenge.

You should really start to generate some example files (made up
by you) and start to code *right now*. Don't believe these
academics who tell you that you absolutely need a four-year-plan
or a thesis on 'performance problems of todays computer systems'.
(although the latter might help :-)

Even if you later decide to use a mysql-db you still have to apply 
the same basic techniques of reading/writing the files, extracting 
ids and writing stuff. And you have to understand what exactly 
you want to do no matter if it is python or python+SQL or whatever.

have fun,

	holger