python and very large data sets???

Wed Apr 24 13:24:03 EDT 2002

On Wed, Apr 24, 2002 at 12:58:29PM -0400, Aahz wrote:
> In article <ad381f5b.0204240841.52333a3e at posting.google.com>,
> Rad <zaka07 at hotmail.com> wrote:
> >
> >I still haven't received above mentioned files so I can't test the
> >time needed to (for example) read a 15GB "file1", filter by few
> >variables, and write a resulting subset as a "sub_file1".  Things
> >would afterwards get more complicated cause I will have to pullout
> >ID's from "sub_file1", remove duplicate ID's create
> >"no_dup_sub_file1", match those to ID's in remaining 3 main files and
> >pullout data linked with those ID's.
> >
> 
> Python *can* handle this kind of task, but you'll be much better off if
> you interface Python to a database. 

I disagree. Don't be so humble :-)
Using a database requires

- setting up/configuring an appropriate database
  for the task (may not be easy)

- getting the info from the files into a database
  (requires reading the file anyway!)

- reading from the database with partial result sets
  and convert again to a file.

i think this is errorprone and quite complex.
why not a more pythonic way like this:

- use module mmap to map source and destination files 
  into addressspace.

- write a method to parse a string that runs over your file
  does some substitutions and writes to a mmaped file.
  python is fast enough here i guess.

if mmap does not work good enough (on windows) you can
resort to read 100MB chunks of the file with file.seek 
or other methods.

it is important to know what your time constraints are.
10 minutes, an hour, a night?

	holger