python and very large data sets???

holger krekel pyth at devel.trillke.net
Wed Apr 24 17:04:54 EDT 2002


On Wed, Apr 24, 2002 at 01:38:29PM -0400, Neal Norwitz wrote:

> To give you an idea of Python's speed, I was able to write out 600MB
> of data in 450 seconds.  I could read the data, modify it, and write
> it back out in 700 seconds.

i am not sure if you code would run much faster if coded the same
way in c...

> 
> By box is 650 Mhz Athlon, 256 MB RAM
> 
> time to write  input file: 450.9 seconds
> time to modify input file: 702.2 seconds
> first line of output file: A B C
> file sizes (input/output): 600000000/600000000

well, this really depends on how you do it (even in c). 
my code example below is about 6 times faster on
a 750MHZ Athlon, 256 MB RAM with an old harddisk (!):

time to write input file[600000 kb]:                  69.1 seconds
time to read input file, upper(), write to new file: 129.8 seconds
file sizes (input/output): 600000kb /600000kb

This seems to be much more IO-bound than python-bound.
I am pretty sure you can improve the modification time
time by using two separate modern disks, MMAP, Threads and what not.

Anyway, Bengt Richter correctly pointed out that one needs
a better specification to come up with a reasonable
estimation.

I just don't happen to see the advantages of bringing
a database into the picture. It seems like a classical
batch job and it 'random access many times' is not needed,
so why?

	holger


-- snip 

#!/usr/bin/env python2.2

import os,sys,time

kb=1024

fn1='/net/projects/bigtestfile'
fn2='/net/projects/bigtestfile.out'

def filesize(fn):
    return os.stat(fn)[6] 

def filekbsize(fn):
    return filesize(fn)/kb
    
def create(kbnum):
    f=open(fn1, 'wb+')
    print "writing file with size %d kb" % kbnum
    for i in xrange(kbnum):
        f.write('a'*kb)
    f.close()

def modify():
    f1=open(fn1,'rb+')
    f2=open(fn2,'wb+')
    f1size=filesize(fn1)

    while 1:
        kbbuf = f1.read(kb)
        if len(kbbuf)==0:
            break
            
        kbbuf = kbbuf.upper()
        f2.write(kbbuf)

    f2.close()
    f1.close()

if __name__=='__main__':
    size = len(sys.argv)>1 and int(sys.argv[1]) or 1000

    start = time.time()
    create(size)
    print 'time to write input file[%d kb]: %.1f seconds' % (filekbsize(fn1),time.time() - start)
    
    start = time.time()
    modify()
    print 'time to read input file, upper(), write to new file: %.1f seconds' % (time.time() - start)

    print 'file sizes (input/output): %dkb /%dkb' % \
                    (filekbsize(fn1),filekbsize(fn2))


-- snip












More information about the Python-list mailing list