python and very large data sets???

Bo bosahvremove at netscape.net
Wed Apr 24 13:50:57 EDT 2002


I agree with the others about a SQL backend, I'd suggest something like 
Postgres. This would be much more robust and probably faster but that isnt 
the first reason I'd suggest it.

You have 4 files (80GB) now and some expectations of the combined data. 
What happens if they change those expectations? If you have established an 
SQL database you will be much more flexible and fast.

> I still haven't received above mentioned files so I can't test the
> time needed to (for example) read a 15GB "file1", filter by few
> variables, and write a resulting subset as a "sub_file1".  Things
> would afterwards get more complicated cause I will have to pullout
> ID's from "sub_file1", remove duplicate ID's create
> "no_dup_sub_file1", match those to ID's in remaining 3 main files and
> pullout data linked with those ID's.

SQL would let you remove duplicates from your results in one step or show 
where those duplicates might be significant (IE. Joe was late how many 
times?)

If this is a recurring requirement at work I would consider hiring a 
contractor to create an elegant solution with defined steps for you to 
operate it.







zaka07 at hotmail.com (Rad) wrote in 
news:ad381f5b.0204240841.52333a3e at posting.google.com:

> I am preparing myself to work on extracting data from 4 text files
> (fixed width format) which combined size is about 80GB.  Considering
> deadlines, costs, and my limited programming knowledge I thought using
> Python/Windows for the job would be the best option for me.  However,
> I am worried about the speed in which Python (me and my hardware) will
> be able to deal with these massive data sets but I am hoping that this
> is still a quicker route then learning C.
> 
> I have a few weeks to prepare myself before data arrives and my
> question is: am I going the right way about the project, is Python
> (with humanly written code) capable of doing this kind of staff
> relatively quickly?
> 
> Any help and suggestions would be greatly appreciated.
> 
> Thanks
> 
> P.S As you probably guessed I'm new to python




More information about the Python-list mailing list