processing a Very Large file

Tue May 17 12:39:58 EDT 2005

[DJTB]
> I'm trying to manually parse a dataset stored in a file. The data should be
> converted into Python objects.
>
> Here is an example of a single line of a (small) dataset:
> 
> 3 13 17 19 -626177023 -1688330994 -834622062 -409108332 297174549 955187488 589884464 -1547848504 857311165 585616830 -749910209 194940864 -1102778558 -1282985276 -1220931512 792256075 -340699912 1496177106 1760327384 -1068195107 95705193 1286147818 -416474772 745439854 1932457456 -1266423822 -1150051085 1359928308 129778935 1235905400 532121853
> 
> The first integer specifies the length of a tuple object. In this case, the
> tuple has three element: (13, 17, 19)
> The other values (-626177023 to 532121853) are elements of a Set.
>
> I use the following code to process a file:
> 
> from time import time
> from sets import Set
> from string import split

Note that you don't use string.split later.

> file = 'pathtable_ht.dat'
> result = []
> start_time = time ()
> f=open(file,'r')
> for line in f:
>        splitres = line.split()

Since they're all integers, may as well:

        splitres = map(int, line.split())

here and skip repeated int() calls later.

>        tuple_size = int(splitres[0])+1
>        path_tuple = tuple(splitres[1:tuple_size])
>        conflicts = Set(map(int,splitres[tuple_size:-1]))

Do you really mean to throw away the last value on the line?  That is,
why is the slice here [tuple_size:-1] rather than [tuple_size:]?

>        # do something with 'path_tuple' and 'conflicts'
>        # ... do some processing ...
>        result.append(( path_tuple, conflicts))
>
> f.close()
> print time() - start_time
> 
> The elements (integer objects) in these Sets are being shared between the
> sets, in fact, there are as many distinct element as there are lines in the
> file (eg 1000 lines -> 1000 distinct set elements). AFAIK, the elements are
> stored only once and each Set contains a pointer to the actual object

Only "small" integers are stored uniquely; e.g., these aren't:

>>> 100 * 100 is 100 * 100
False
>>> int("12345") is int("12345")
False

You could manually do something akin to Python's "string interning" to
store ints uniquely, like:

    int_table = {}
    def uniqueint(i):
        return int_table.setdefault(i, i)

Then, e.g.,

>>> uniqueint(100 * 100) is uniqueint(100 * 100) 
True
>>> uniqueint(int("12345")) is uniqueint(int("12345"))
True

Doing Set(map(uniqueint, etc)) would then feed truly shared int
(and/or long) objects to the Set constructor.

> This works fine with relatively small datasets, but it doesn't work at all
> with large datasets (4500 lines, 45000 chars per line).

Well, chars/line doesn't mean anything to us.  Knowing # of set
elements/line might help.  Say there are 4500 per line.  Then you've
got about 20 million integers.  That will consume at least several 100
MB if you don't work to share duplicates.  But if you do so work, it
should cut the memory burden by a factor of thousands.

> After a few seconds of loading, all main memory is consumed by the Python
> process and the computer starts swapping. After a few more seconds, CPU
> usage drops from 99% to 1% and all swap memory is consumed:
>
> Mem:    386540k total,   380848k used,     4692k free,      796k buffers
> Swap:   562232k total,   562232k used,        0k free,    27416k cached
>
> At this point, my computer becomes unusable.
>
> I'd like to know if I should buy some more memory (a few GB?) or if it is
> possible to make my code more memory efficient.

See above for the latter.  If you have a 32-bit processor, you won't
be able to _address_ more than a few GB anyway.  Still, 384MB of RAM
is on the light side these days <wink>.