python reading file memory cost

Chris Rebert clp2 at rebertia.com
Tue Aug 2 02:55:15 EDT 2011


On Mon, Aug 1, 2011 at 8:22 PM, Tony Zhang <warriorlance at gmail.com> wrote:
> Thanks!
>
> Actually, I used .readline() to parse file line by line, because I need
> to find out the start position to extract data into list, and the end
> point to pause extracting, then repeat until the end of file.
> My file to read is formatted like this:
>
> blabla...useless....
> useless...
>
> /sign/
> data block(e.g. 10 cols x 1000 rows)
> ...
> blank line
> /sign/
> data block(e.g. 10 cols x 1000 rows)
> ...
> blank line
> ...
> ...
> EOF
> let's call this file 'myfile'
> and my python snippet:
>
> f=open('myfile','r')
> blocknum=0 #number the data block
> data=[]
> while True"
>        # find the extract begnning
>        while not f.readline().startswith('/a1/'):pass
>        # creat multidimensional list to store data block
>        data=append([])
>        blocknum +=1
>        line=f.readline()
>
>        while line.strip():
>        # check if the line is a blank line, i.e the end of one block
>                data[blocknum-1].append(["2.6E" %float(x) for x in line.split()])
>                line = f.readline()
>        print "Read Block %d" %blocknum
>        if not f.readline(): break
>
> The running result was that read a 500M file consume almost 2GB RAM, I
> cannot figure it out, somebody help!

If you could store the floats themselves, rather than their string
representations, that would be more space-efficient. You could then
also use the `array` module, which is more space-efficient than lists
(http://docs.python.org/library/array.html ). Numpy would also be
worth investigating since multidimensional arrays are involved.

The next obvious question would then be: do you /really/ need /all/ of
the data in memory at once?

Also, just so you're aware:
http://docs.python.org/library/sys.html#sys.getsizeof

Cheers,
Chris
--
http://rebertia.com



More information about the Python-list mailing list