processing a Very Large file

Tue May 17 15:32:30 EDT 2005

DJTB wrote:
> I'm trying to manually parse a dataset stored in a file. The 
> data should be converted into Python objects.
> 
> Here is an example of a single line of a (small) dataset:
> 
> 3 13 17 19 -626177023 -1688330994 -834622062 -409108332 
> 297174549 955187488
> 589884464 -1547848504 857311165 585616830 -749910209 
> 194940864 -1102778558
> -1282985276 -1220931512 792256075 -340699912 1496177106 1760327384
> -1068195107 95705193 1286147818 -416474772 745439854 
> 1932457456 -1266423822
> -1150051085 1359928308 129778935 1235905400 532121853
> 
> The first integer specifies the length of a tuple object. In 
> this case, the tuple has three element: (13, 17, 19)
> The other values (-626177023 to 532121853) are elements of a Set.
> 
> I use the following code to process a file:
> 
> 
> from time import time
> from sets import Set
> from string import split
> file = 'pathtable_ht.dat'
> result = []
> start_time = time ()
> f=open(file,'r')
> for line in f:
>         splitres = line.split()
>         tuple_size = int(splitres[0])+1
>         path_tuple = tuple(splitres[1:tuple_size])
>         conflicts = Set(map(int,splitres[tuple_size:-1]))
>         # do something with 'path_tuple' and 'conflicts'
>         # ... do some processing ...
>         result.append(( path_tuple, conflicts))
> 
> f.close()
> print time() - start_time
> 
> 
> The elements (integer objects) in these Sets are being shared 
> between the
> sets, in fact, there are as many distinct element as there 
> are lines in the
> file (eg 1000 lines -> 1000 distinct set elements). AFAIK, 
> the elements are
> stored only once and each Set contains a pointer to the actual object
> 
> This works fine with relatively small datasets, but it 
> doesn't work at all
> with large datasets (4500 lines, 45000 chars per line).
> 
> After a few seconds of loading, all main memory is consumed 
> by the Python
> process and the computer starts swapping. After a few more 
> seconds, CPU
> usage drops from 99% to 1% and all swap memory is consumed:
> 
> Mem:    386540k total,   380848k used,     4692k free,      
> 796k buffers
> Swap:   562232k total,   562232k used,        0k free,    
> 27416k cached
> 
> At this point, my computer becomes unusable.
> 
> I'd like to know if I should buy some more memory (a few GB?) 
> or if it is
> possible to make my code more memory efficient.

The first question I would ask is: what are you doing with "result", and
can the consumption of "result" be done iteratively?

Robert Brewer
System Architect
Amor Ministries
fumanchu at amor.org