[SciPy-user] handling of huge files for post-processing

David Huard david.huard at gmail.com
Tue Feb 26 12:23:01 EST 2008


Whether or not PyTables is going to make a difference really depends on how
much data you need at a given time to perform the computation. If this
exceeds your RAM, it doesn't matter what binary format you are using. That
being said, I am not familiar with sqlite, so I don't know if there is some
limitations regarding the database size.

Storing your data using PyTables will allow you to store as many GB in a
single file as you wish. The tricky part will then be to extract only the
data that you need to perform your computations and make sure this always
stays below the RAM limit, or else the swap memory will be used and it will
slow down things considerably.

I suggest you try to estimate how much memory you'll be needing for your
computations, see how much RAM you have, and decide whether or not you
should just spend some euros and install additional RAM.

Servus,

David

2008/2/26, Christoph Scheit <Christoph.Scheit at lstm.uni-erlangen.de>:
>
> Hello David,
>
> indeed data in file a depends on data in file b...
> that the biggest problem and consequently
> I guess I need something that operates better
> on the file-system than in main memory.
>
> Do you think, it's possible to use PyTables to
> tackle the problem? I would need something
> that can group together such enormous
> data-sets. sqlite is nice to group data of
> a table together, but I guess my data-sets are
> just to big...
>
> Acutally I unfortunately don't see the possibility
> to iterate over the entries of the files in the
> manner you described below....
>
> Thanks,
>
> Christoph
> ------------------------------
>
> Message: 3
> Date: Tue, 26 Feb 2008 09:17:00 -0500
>
> From: "David Huard" <david.huard at gmail.com>
> Subject: Re: [SciPy-user] handling of huge files for post-processing
> To: "SciPy Users List" <scipy-user at scipy.org>
> Message-ID:
>
>         <91cf711d0802260617o4d768824wbf5fae702b59f00a at mail.gmail.com>
>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> Cristoph,
>
> Do you mean that b depends on the entire dataset a ? In this case, you
> might
> consider buying additional memory; this is often way cheaper in terms of
> time than trying to optimize the code.
>
> What I mean by iterators is that when you open a binary file, you
> generally
> have the possibility to iterate over each element in the file. For
> instance,
> when reading an ascii file:
>
> for line in f.readline():
>     some operation on the current line.
>
> instead of loading all the file in memory:
> lines = f.readlines()
>
> This way, only one line is kept in memory at a time. If you can write your
> code in this manner, this might solve your memory problem. For instance,
> here is a generator that opens two files and will return the current line
> of
> each file each time it's next() method is called
>   def read():
>     a = open('filea', 'r')
>     b = open('fileb', 'r')
>     la = a.readline()
>     lb = b.readline()
>     while (la and lb):
>         yield la,lb
>         la = a.readline()
>         lb = b.readline()
>
> for a, b in read():
>   some operation on a,b
>
> HTH,
>
> David
>
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20080226/1e6f285f/attachment.html>


More information about the SciPy-User mailing list