python and very large data sets???
Fernando Pérez
fperez528 at yahoo.com
Wed Apr 24 13:31:41 EDT 2002
Rad wrote:
> I am preparing myself to work on extracting data from 4 text files
> (fixed width format) which combined size is about 80GB. Considering
> deadlines, costs, and my limited programming knowledge I thought using
> Python/Windows for the job would be the best option for me. However,
> I am worried about the speed in which Python (me and my hardware) will
> be able to deal with these massive data sets but I am hoping that this
> is still a quicker route then learning C.
> I still haven't received above mentioned files so I can't test the
> time needed to (for example) read a 15GB "file1", filter by few
> variables, and write a resulting subset as a "sub_file1". Things
> would afterwards get more complicated cause I will have to pullout
> ID's from "sub_file1", remove duplicate ID's create
> "no_dup_sub_file1", match those to ID's in remaining 3 main files and
> pullout data linked with those ID's.
>
> I have a few weeks to prepare myself before data arrives and my
> question is: am I going the right way about the project, is Python
> (with humanly written code) capable of doing this kind of staff
> relatively quickly?
>
> Any help and suggestions would be greatly appreciated.
May I suggest that you also pick up some basic (4 hrs worth) grep/awk
knowledge? Sometimes I find that for quickly extracting a few tagged data
fields from a large file, awk is much faster than python. I'll make the awk
call in my python code and then keep operating for the more complex stuff in
python.
I only use awk for _very simple_ pattern-based or column-based extractions,
but for those tasks it's super-easy and it's quite fast (all C). Anything
even minimally complicated I leave to the actual python code.
Just an idea. On a big problem, a single-tool mindset may be a bit
constricting, give yourself some mental room.
good luck,
f.
More information about the Python-list
mailing list