[SciPy-user] handling of huge files for post-processing

David Huard david.huard at gmail.com
Tue Feb 26 09:17:00 EST 2008


Cristoph,

Do you mean that b depends on the entire dataset a ? In this case, you might
consider buying additional memory; this is often way cheaper in terms of
time than trying to optimize the code.

What I mean by iterators is that when you open a binary file, you generally
have the possibility to iterate over each element in the file. For instance,
when reading an ascii file:

for line in f.readline():
    some operation on the current line.

instead of loading all the file in memory:
lines = f.readlines()

This way, only one line is kept in memory at a time. If you can write your
code in this manner, this might solve your memory problem. For instance,
here is a generator that opens two files and will return the current line of
each file each time it's next() method is called
 def read():
    a = open('filea', 'r')
    b = open('fileb', 'r')
    la = a.readline()
    lb = b.readline()
    while (la and lb):
        yield la,lb
        la = a.readline()
        lb = b.readline()

for a, b in read():
  some operation on a,b

HTH,

David





2008/2/26, Christoph Scheit <Christoph.Scheit at lstm.uni-erlangen.de>:
>
> Hello David,
>
> I guess that everythink is kept in memory... but I don't
> know how to handle this problem using iterators. Can
> you give me some more detail? You read your files
> all in once?
>
> One problem is, that, let's assume I have three files
> a, b and c, then
> b depends on data from a
> c depends on data from b (and maybe from a, but
> this might be not the case in 99%)
> This is due to differences in signal runtime...
>
> christoph
>
> ------------------------------
>
> Message: 4
> Date: Mon, 25 Feb 2008 09:53:31 -0500
> From: "David Huard" <david.huard at gmail.com>
> Subject: Re: [SciPy-user] handling of huge files for post-processing
> To: "SciPy Users List" <scipy-user at scipy.org>
> Message-ID:
>         <91cf711d0802250653g652df1f9mdd9aaa5adf869bc5 at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> Hi Cristoph,
>
> I am not sure exactly what causes your method to fail but it might be that
> you are trying to hold all the arrays in memory at once. Can you do your
> calculation using iterators/generators ? The idea is to load into memory
> only the part of the array that you need for a given calculation, store
> the
> result and continue iterating.  I used to process ~2GB files using
> iterators
> from PyTables tables and it worked smoothly.
>
> David
>
>
> 2008/2/25, Christoph Scheit <Christoph.Scheit at lstm.uni-erlangen.de>:
> >
> > Hello everybody,
> >
> > I get from a Fortran-Code (CFD) binary files containing
> > the acoustic pressure at some distinct points.
> > The files has N "lines" which look like this:
> >
> > TimeStep(int) DebugInfo (int) AcousticPressure(float)
> >
> > and is binary. My problem is now, that the file can be
> > huge (> 100 MB) and that after several runs on a cluster
> > indeed not only one but 20 - 50 files of that size are
> > to be post-processed.
> >
> > Since the CFD code runs parallel, I have to sum up
> > the results from different cpu's (cpu 1 calculates only
> > a fraction of the acoustic pressure of point p and time step
> > t, so that I have to sum over all cpu's)
> >
> > Currently I'm reading all the data into a sqlite-table, than
> > I group the data, summing up over the processors and
> > then I'm writing out files containing the data of the single
> > points. This approach works for smaller files somehow,
> > but does not seem to be working for big files like described
> > above.
> >
> > Do you have some ideas on this problem? Thank you very
> > much in advance,
> >
> > Christoph
> > _______________________________________________
> > SciPy-user mailing list
> > SciPy-user at scipy.org
> > http://projects.scipy.org/mailman/listinfo/scipy-user
> >
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://projects.scipy.org/pipermail/scipy-user/attachments/20080225/33d1fb1c/attachment-0001.html
>
> ------------------------------
>
> Message: 5
> Date: Mon, 25 Feb 2008 15:58:13 +0100
> From: Johann Cohen-Tanugi <cohen at slac.stanford.edu>
> Subject: Re: [SciPy-user] order in profiles and packages
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID: <47C2D785.9090405 at slac.stanford.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> my apologies, this was the wrong list.... I submitted it to ipython list.
> Johan
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 25 Feb 2008 17:14:27 +0100
> From: "Shane Legg" <shane at vetta.org>
> Subject: [SciPy-user] Bug in matplotlib plot_wireframe?
> To: scipy-user at scipy.org
> Message-ID:
>         <d13d7ef40802250814v77ec0acbtfbf54f7e7e5c20db at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> I'm new here so if this isn't the right place to ask just let
> me know where I should head.  Thanks.
>
> I think there is a significant bug in plot_wireframe in matplotlib
> where it incorrectly displays the Z axis values.  The code below
> demonstrates the problem:
>
>
> import scipy
> import pylab as p
> import matplotlib.axes3d as p3
> from numpy import *
>
> """
> # If you do a wire frame of the following, the graph is correct:
> Z = scipy.array(
> [[ 0.52,  0.00020],
>   [ 0.45,  0.00018],
>   [ 0.34,  0.00016]] )
> """
>
> # but if you put negative signs in:
> Z = scipy.array(
> [[ -0.52,  -0.00020],
>   [ -0.45,  -0.00018],
>   [ -0.34,  -0.00016]] )
>
> """
>   the graph displays:
> [[ -0.62, -0.10020 ],
>   [ -0.55, -0.10018 ],
>   [ -0.44, -0.10016 ]]
> """
>
> X, Y = meshgrid(arange(0, 3, 1.0), arange(0, 4, 1.0))
>
> fig = p.figure()
> ax = p3.Axes3D(fig)
> ax.plot_wireframe(X, Y, Z)
>
> ax.set_xlabel('X')
> ax.set_ylabel('Y')
> ax.set_zlabel('Z')
>
> p.show()
>
>
> I'm running Ubuntu 7.10 x64 with python 2.5.1-1ubuntu2 and
> python-scipy 0.5.2-9ubuntu4 both installed from the .deb files.
> I sent the above code to somebody with a 32bit Linux system
> and they had the same problem.
>
> Any help appreciated!
>
> Cheers
> Shane
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://projects.scipy.org/pipermail/scipy-user/attachments/20080225/6f9bbe82/attachment-0001.html
>
> ------------------------------
>
> Message: 7
> Date: Mon, 25 Feb 2008 10:53:22 -0600
> From: "Robert Kern" <robert.kern at gmail.com>
> Subject: Re: [SciPy-user] Bug in matplotlib plot_wireframe?
> To: shane at vetta.org, "SciPy Users List" <scipy-user at scipy.org>
> Message-ID:
>         <3d375d730802250853j112bb67ah84847faef07b1255 at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Mon, Feb 25, 2008 at 10:14 AM, Shane Legg <shane at vetta.org> wrote:
> > Hi,
> >
> > I'm new here so if this isn't the right place to ask just let
> > me know where I should head.  Thanks.
>
> The appropriate matplotlib list is here:
>
>   https://lists.sourceforge.net/lists/listinfo/matplotlib-users
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
>   -- Umberto Eco
>
>
> ------------------------------
>
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>
>
>
> End of SciPy-user Digest, Vol 54, Issue 48
> ******************************************
>
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20080226/558dfaa0/attachment.html>


More information about the SciPy-User mailing list