python and very large data sets???

Wed Apr 24 13:31:41 EDT 2002

Rad wrote:

> I am preparing myself to work on extracting data from 4 text files
> (fixed width format) which combined size is about 80GB.  Considering
> deadlines, costs, and my limited programming knowledge I thought using
> Python/Windows for the job would be the best option for me.  However,
> I am worried about the speed in which Python (me and my hardware) will
> be able to deal with these massive data sets but I am hoping that this
> is still a quicker route then learning C.
> I still haven't received above mentioned files so I can't test the
> time needed to (for example) read a 15GB "file1", filter by few
> variables, and write a resulting subset as a "sub_file1".  Things
> would afterwards get more complicated cause I will have to pullout
> ID's from "sub_file1", remove duplicate ID's create
> "no_dup_sub_file1", match those to ID's in remaining 3 main files and
> pullout data linked with those ID's.
> 
> I have a few weeks to prepare myself before data arrives and my
> question is: am I going the right way about the project, is Python
> (with humanly written code) capable of doing this kind of staff
> relatively quickly?
> 
> Any help and suggestions would be greatly appreciated.

May I suggest that you also pick up some basic (4 hrs worth) grep/awk 
knowledge? Sometimes I find that for quickly extracting a few tagged data 
fields from a large file, awk is much faster than python. I'll make the awk 
call in my python code and then keep operating for the more complex stuff in 
python.

I only use awk for _very simple_ pattern-based or column-based extractions, 
but for those tasks it's super-easy and it's quite fast (all C). Anything 
even minimally complicated I leave to the actual python code.

Just an idea. On a big problem, a single-tool mindset may be a bit 
constricting, give yourself some mental room.

good luck,

f.