[SciPy-user] handling of huge files for post-processing
Christoph Scheit
Christoph.Scheit at lstm.uni-erlangen.de
Tue Feb 26 10:27:42 EST 2008
Hello David,
indeed data in file a depends on data in file b...
that the biggest problem and consequently
I guess I need something that operates better
on the file-system than in main memory.
Do you think, it's possible to use PyTables to
tackle the problem? I would need something
that can group together such enormous
data-sets. sqlite is nice to group data of
a table together, but I guess my data-sets are
just to big...
Acutally I unfortunately don't see the possibility
to iterate over the entries of the files in the
manner you described below....
Thanks,
Christoph
------------------------------
Message: 3
Date: Tue, 26 Feb 2008 09:17:00 -0500
From: "David Huard" <david.huard at gmail.com>
Subject: Re: [SciPy-user] handling of huge files for post-processing
To: "SciPy Users List" <scipy-user at scipy.org>
Message-ID:
<91cf711d0802260617o4d768824wbf5fae702b59f00a at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Cristoph,
Do you mean that b depends on the entire dataset a ? In this case, you might
consider buying additional memory; this is often way cheaper in terms of
time than trying to optimize the code.
What I mean by iterators is that when you open a binary file, you generally
have the possibility to iterate over each element in the file. For instance,
when reading an ascii file:
for line in f.readline():
some operation on the current line.
instead of loading all the file in memory:
lines = f.readlines()
This way, only one line is kept in memory at a time. If you can write your
code in this manner, this might solve your memory problem. For instance,
here is a generator that opens two files and will return the current line of
each file each time it's next() method is called
def read():
a = open('filea', 'r')
b = open('fileb', 'r')
la = a.readline()
lb = b.readline()
while (la and lb):
yield la,lb
la = a.readline()
lb = b.readline()
for a, b in read():
some operation on a,b
HTH,
David
More information about the SciPy-User
mailing list