save tuple of simple data types to disk (low memory foot print)

Sat Oct 29 13:47:42 EDT 2011

On 10/29/11 11:44, Gelonida N wrote:
> I would like to save many dicts with a fixed (and known) amount of keys
> in a memory efficient manner (no random, but only sequential access is
> required) to a file (which can later be sent over a slow expensive
> network to other machines)
>
> Example:
> Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
> 'message1', 'message2'
> 'timestamp' is an integer
> 'floatvalue' is a float
> 'intvalue' an int
> 'message1' is a string with a length of max 2000 characters, but can
> often be very short
> 'message2' the same as message1
>
> so a typical dict will look like
> { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
>   'message1' : '', 'message2' : '=' * 1999 }
>
>
>>
>> What do you call "many"? Fifty? A thousand? A thousand million? How many
>> items in each dict? Ten? A million?
>
> File size can be between 100kb and over 100Mb per file. Files will be
> accumulated over months.

If Steven's pickle-protocol2 solution doesn't quite do what you 
need, you can do something like the code below.  Gzip is pretty 
good at addressing...

>> Or have you considered simply compressing the files?
> Compression makes sense but the inital file format should be
> already rather 'compact'

...by compressing out a lot of the duplicate aspects.  Which also 
mitigates some of the verbosity of CSV.

It serializes the data to a gzipped CSV file then unserializes 
it.  Just point it at the appropriate data-source, adjust the 
column-names and data-types

-tkc

from gzip import GzipFile
from csv import writer, reader

data = [ # use your real data here
     {
     'timestamp': 12,
     'floatvalue': 3.14159,
     'intvalue': 42,
     'message1': 'hello world',
     'message2': '=' * 1999,
     },
     ] * 10000

f = GzipFile('data.gz', 'wb')
try:
     w = writer(f)
     for row in data:
         w.writerow([
             row[name] for name in (
             # use your real col-names here
             'timestamp',
             'floatvalue',
             'intvalue',
             'message1',
             'message2',
             )])
finally:
     f.close()

output = []
for row in reader(GzipFile('data.gz')):
     d = dict((
         (name, f(row[i]))
         for i, (f,name) in enumerate((
             # adjust for your column-names/data-types
             (int, 'timestamp'),
             (float, 'floatvalue'),
             (int, 'intvalue'),
             (str, 'message1'),
             (str, 'message2'),
             ))))
     output.append(d)

# or

output = [
     dict((
         (name, f(row[i]))
         for i, (f,name) in enumerate((
             # adjust for your column-names/data-types
             (int, 'timestamp'),
             (float, 'floatvalue'),
             (int, 'intvalue'),
             (str, 'message1'),
             (str, 'message2'),
             ))))
     for row in reader(GzipFile('data.gz'))
     ]