python and very large data sets???
holger krekel
pyth at devel.trillke.net
Wed Apr 24 17:04:54 EDT 2002
On Wed, Apr 24, 2002 at 01:38:29PM -0400, Neal Norwitz wrote:
> To give you an idea of Python's speed, I was able to write out 600MB
> of data in 450 seconds. I could read the data, modify it, and write
> it back out in 700 seconds.
i am not sure if you code would run much faster if coded the same
way in c...
>
> By box is 650 Mhz Athlon, 256 MB RAM
>
> time to write input file: 450.9 seconds
> time to modify input file: 702.2 seconds
> first line of output file: A B C
> file sizes (input/output): 600000000/600000000
well, this really depends on how you do it (even in c).
my code example below is about 6 times faster on
a 750MHZ Athlon, 256 MB RAM with an old harddisk (!):
time to write input file[600000 kb]: 69.1 seconds
time to read input file, upper(), write to new file: 129.8 seconds
file sizes (input/output): 600000kb /600000kb
This seems to be much more IO-bound than python-bound.
I am pretty sure you can improve the modification time
time by using two separate modern disks, MMAP, Threads and what not.
Anyway, Bengt Richter correctly pointed out that one needs
a better specification to come up with a reasonable
estimation.
I just don't happen to see the advantages of bringing
a database into the picture. It seems like a classical
batch job and it 'random access many times' is not needed,
so why?
holger
-- snip
#!/usr/bin/env python2.2
import os,sys,time
kb=1024
fn1='/net/projects/bigtestfile'
fn2='/net/projects/bigtestfile.out'
def filesize(fn):
return os.stat(fn)[6]
def filekbsize(fn):
return filesize(fn)/kb
def create(kbnum):
f=open(fn1, 'wb+')
print "writing file with size %d kb" % kbnum
for i in xrange(kbnum):
f.write('a'*kb)
f.close()
def modify():
f1=open(fn1,'rb+')
f2=open(fn2,'wb+')
f1size=filesize(fn1)
while 1:
kbbuf = f1.read(kb)
if len(kbbuf)==0:
break
kbbuf = kbbuf.upper()
f2.write(kbbuf)
f2.close()
f1.close()
if __name__=='__main__':
size = len(sys.argv)>1 and int(sys.argv[1]) or 1000
start = time.time()
create(size)
print 'time to write input file[%d kb]: %.1f seconds' % (filekbsize(fn1),time.time() - start)
start = time.time()
modify()
print 'time to read input file, upper(), write to new file: %.1f seconds' % (time.time() - start)
print 'file sizes (input/output): %dkb /%dkb' % \
(filekbsize(fn1),filekbsize(fn2))
-- snip
More information about the Python-list
mailing list