processing a Very Large file
Robert Brewer
fumanchu at amor.org
Tue May 17 15:32:30 EDT 2005
DJTB wrote:
> I'm trying to manually parse a dataset stored in a file. The
> data should be converted into Python objects.
>
> Here is an example of a single line of a (small) dataset:
>
> 3 13 17 19 -626177023 -1688330994 -834622062 -409108332
> 297174549 955187488
> 589884464 -1547848504 857311165 585616830 -749910209
> 194940864 -1102778558
> -1282985276 -1220931512 792256075 -340699912 1496177106 1760327384
> -1068195107 95705193 1286147818 -416474772 745439854
> 1932457456 -1266423822
> -1150051085 1359928308 129778935 1235905400 532121853
>
> The first integer specifies the length of a tuple object. In
> this case, the tuple has three element: (13, 17, 19)
> The other values (-626177023 to 532121853) are elements of a Set.
>
> I use the following code to process a file:
>
>
> from time import time
> from sets import Set
> from string import split
> file = 'pathtable_ht.dat'
> result = []
> start_time = time ()
> f=open(file,'r')
> for line in f:
> splitres = line.split()
> tuple_size = int(splitres[0])+1
> path_tuple = tuple(splitres[1:tuple_size])
> conflicts = Set(map(int,splitres[tuple_size:-1]))
> # do something with 'path_tuple' and 'conflicts'
> # ... do some processing ...
> result.append(( path_tuple, conflicts))
>
> f.close()
> print time() - start_time
>
>
> The elements (integer objects) in these Sets are being shared
> between the
> sets, in fact, there are as many distinct element as there
> are lines in the
> file (eg 1000 lines -> 1000 distinct set elements). AFAIK,
> the elements are
> stored only once and each Set contains a pointer to the actual object
>
> This works fine with relatively small datasets, but it
> doesn't work at all
> with large datasets (4500 lines, 45000 chars per line).
>
> After a few seconds of loading, all main memory is consumed
> by the Python
> process and the computer starts swapping. After a few more
> seconds, CPU
> usage drops from 99% to 1% and all swap memory is consumed:
>
> Mem: 386540k total, 380848k used, 4692k free,
> 796k buffers
> Swap: 562232k total, 562232k used, 0k free,
> 27416k cached
>
> At this point, my computer becomes unusable.
>
> I'd like to know if I should buy some more memory (a few GB?)
> or if it is
> possible to make my code more memory efficient.
The first question I would ask is: what are you doing with "result", and
can the consumption of "result" be done iteratively?
Robert Brewer
System Architect
Amor Ministries
fumanchu at amor.org
More information about the Python-list
mailing list