Newbie completely confused

Jeroen Hegeman jeroen.hegeman at gmail.com
Fri Sep 21 12:34:40 EDT 2007


Dear Pythoneers,

I'm moderately new to python and it got me completely lost already.

I've got a bunch of large (30MB) txt files containing one 'event' per  
line. I open files after each other, read them line by line and from  
each line build a 'data structure' of a main class (HugeClass)  
containing some simple information as well as several instances of  
some other classes.

No problem so far, but I noticed that the first file was always  
faster than the others, whereas I would expect it to be slower, if  
anything. Testing with two copies of the same file shows the same  
behaviour.

Below is a (rather large, I'll explain) chunk of code. I ran this in  
a directory with two test files called 'test_file0.txt' and  
'test_file1.txt', each containing 10k lines of the same information  
as the 'long_line' variable in the code. This shows the following  
timing (consistently) for the little piece of code that reads all  
lines from file:

...processing all 2 files found
--> 1/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 0.093 s
...took 8.85717201233 seconds
--> 2/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 3.917 s
...took 12.8725550175 seconds

So the first time around the file gets read in in ~0.1 seconds, the  
second time around it needs almost four seconds! As far as I can see  
this is related to 'something in memory being copied around' since if  
I replace the 'alternative 1' by the 'alternative 2', basically  
making sure that my classes are not used, reading time the second  
time around drops back to normal (= roughly what it is the first pass).

I already want to apologise for the size of the code chunk below. I  
know about 'minimal reproducible examples' and such but I found out  
that if I commented out the filling (and thus binding) of some of the  
member variables in the lower-level classes, the problem (sometimes)  
also disappears. That also points to some magic happening in memory?

I probably mucked something up but I'm really lost as to where. Any  
help would be appreciated.

The original problem showed up using Python 2.4.3 under linux (Fedora  
Core 1).
Python 2.3.5 on OS X 10.4.10 (PPC) appears not to show this issue(?).

Thanks,
Jeroen

P.S. Any ideas on optimising the input to the classes would be  
welcome too ;-)

Jeroen Hegeman
jeroen DOT hegeman AT gmail DOT com



===================Start of code chunk=========================
#!/usr/bin/env python

import time
import sys
import os
import gzip
import pdb

long_line =  
"1,31905,0,174501,46152419,2117961,143,-1.0000,51,2,-19.9139,42,-19.9140 
, 
6.6002,0,0,0,46713.1484,2,0.0000,-1,1.4203220606,0.3876158297,147.121017 
4561,147.1284120973,-2,0.0000,-1,1.5887237787,-2.4011900425,-319.7776794 
434,319.7906836817,4,21,0.0000,-1,-0.5672637224,2.2052443027,-43.2842369 
080,43.3440905719,21,0.0000,-1,-0.8540721536,0.0770076364,-22.7033920288 
, 
22.7195827425,21,0.0000,-1,0.1623233557,0.5845987201,-28.0794525146,28.0 
860084170,21,0.0000,-1,0.1943928897,-0.2195242196,-22.0666370392,22.0685 
899391,6,0.0000,-1,-40.1810989380,-127.0743789673,-104.9231948853,239.74 
36794163,-6,0.0000,-1,43.2013626099,125.0640945435,-67.7339172363,227.17 
53587387,24,0.0000,-1,-57.9123306274,-17.3483123779,-71.8334121704,123.4 
397648033,-24,0.0000,-1,84.0985488892,54.4542312622,-62.4525032043,144.5 
299239704,5,0.0000,-1,17.7312316895,-109.7260665894,-33.0897827148,116.3 
039146130,-5,0.0000,-1,-40.8971862793,70.6098632812,-5.2814140320,82.645 
4347683,4,0.0000,-1,-6.2859884724,-17.9586020410,-58.9464384913,69.40294 
68585,-3,0.0000,-1,-51.6263811588,0.6104701459,-12.8869901896,54.0368221 
571,3,0.0000,-1,16.4690684490,48.0271777511,-51.7867884636,74.5327484701 
,-4,0.0000,-1,67.6295298338,6.4269350171,-10.6658525467,69.9971834876,7, 
7,1.0345464706e+01,-7.0800781250e+01,-2.0385742187e+01,7.5256346272e 
+01,1.3148,0.0072,0.0072,1.3148,0.0072,0.0072,1.0255,1.0413,0.0,0.0,0.0, 
0.0,-1.0,-4.2383,49.5276,13,0.1537,0.5156,0,0.9982,0.0034,1.0000,7,1,0.9 
566,0.0062,1,0,2,1.2736,1,7.8407,1,0,2,1.2736,1,7.8407,0,0,-1.0,-1.0,5,1 
,-2.4047853470e+01,4.0832519531e+01,-3.8452150822e+00,4.7851562559e 
+01,1.3383,0.0051,0.0051,1.3383,0.0051,0.0051,0.9340,0.9541,0.0,0.0,0.0, 
0.0,-1.0,-2.4609,21.3916,7,0.1166,0.5977,0,0.9999,0.0052,1.0000,9,1,0.99 
47,0.0063,1,0,2,0.7735,1,74.7937,1,0,2,0.7735,1,74.7937,0,0,-1.0,-1.0,5, 
1,-4.4067382812e+01,2.5634796619e+00,-1.1138916016e+01,4.6203614579e 
+01,1.3533,0.0054,0.0054,1.3533,0.0054,0.0054,1.0486,1.0903,0.0,0.0,0.0, 
0.0,-1.0,-3.9648,31.3733,13,0.1767,0.5508,100,0.9977,0.0040,1.0000,9,1,0 
. 
0000,0.4349,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0 
,-1.0,0,1,3.7200927734e+01,2.7465817928e+00,-5.5847163200e 
+00,3.7994386563e 
+01,1.3634,0.0062,0.0062,1.6488,0.0385,0.0385,0.7141,0.9013,5.3986899118 
e+00,6.6766492833e-01,-2.3780213181e-01,5.4460399892e 
+00,0.5504,-3.1445,0.7776,9,0.1169,0.7734,0,0.9977,0.0040,1.0000,7,1,0.0 
000,0.1099,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,1,-1,5.38 
93,0.5459,4,1,1.2969970703e+01,3.3203125000e+01,-3.7231445312e 
+01,5.2001951876e 
+01,1.4414,0.0129,0.0129,1.4414,0.0129,0.0129,0.9019,0.7331,0.0,0.0,0.0, 
0.0,-1.0,-10.0195,12.2034,17,0.1922,0.3633,0,0.9774,0.0248,1.0000,6,1,0. 
0000,0.3523,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0 
,-1.0,0,1,-1.6174327135e+00,-7.1411132812e+00,-1.8798828125e 
+01,2.0202637222e 
+01,1.7886,0.0352,0.0352,1.7886,0.0352,0.0352,1.8257,1.2368,0.0,0.0,0.0, 
0.0,-1.0,-17.3438,45.6714,10,0.1529,0.5625,0,0.9898,0.0094,1.0000,3,1,-1 
. 
0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1.0 
,-1.0,-6,0,-5.9204106331e+00,-3.4484868050e+00,-6.5307617187e 
+00,9.6740722971e 
+00,1.6782,0.0326,0.0326,1.6782,0.0326,0.0326,1.0000,1.0000,0.0,0.0,0.0, 
0.0,-1.0,-9.4727,37.3401,13,0.2711,0.2344,100,0.9861,0.0045,1.0000,3,1,- 
1.0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1 
.0,-1.0,-6,0"

######################################################################## 
###

class SmallClass:
     def __init__(self):
         return
     def input(self, line, c):
         self.item0 = int(line[c]); c += 1
         self.item1 = float(line[c]); c += 1
         self.item2 = int(line[c]); c += 1
         self.item3 = float(line[c]); c += 1
         self.item4 = float(line[c]); c += 1
         self.item5 = float(line[c]); c += 1
         self.item6 = float(line[c]); c += 1
         return c

######################################################################## 
###

class ModerateClass:
     def __init__(self):
         return
     def __del__(self):
         pass
         return
     def input(self, line, c):

         self.items = {}

         self.item0 = float(line[c]);
         c += 1

         unit1 = SmallClass()
         c = unit1.input(line, c)
         self.items[len(self.items)] = unit1
         unit2 = SmallClass()
         c = unit2.input(line, c)
         self.items[len(self.items)] = unit2

         units_chunk = []
         chunk_size = int(line[c])
         c += 1
         for i in xrange(chunk_size):
             unit = SmallClass()
             c = unit.input(line, c)
             units_chunk.append(unit)
         for i in xrange(10):
             unit = SmallClass()
             c = unit.input(line, c)
         return c

######################################################################## 
###

class LongClass:

     def __init__(self):
         return
     def clear(self):
         return
     def input(self, foo, c):
         self.item0 = float(foo[c]); c += 1
         self.item1 = float(foo[c]); c += 1
         self.item2 = float(foo[c]); c += 1
         self.item3 = float(foo[c]); c += 1
         self.item4 = float(foo[c]); c+=1
         self.item5 = float(foo[c]); c+=1
         self.item6 = float(foo[c]); c+=1
         self.item7 = float(foo[c]); c+=1
         self.item8 = float(foo[c]); c+=1
         self.item9 = float(foo[c]); c+=1
         self.item10 = float(foo[c]); c+=1
         self.item11 = float(foo[c]); c+=1
         self.item12 = float(foo[c]); c += 1
         self.item13 = float(foo[c]); c += 1
         self.item14 = float(foo[c]); c += 1
         self.item15 = float(foo[c]); c += 1
         self.item16 = float(foo[c]); c+=1
         self.item17 = float(foo[c]); c+=1
         self.item18 = float(foo[c]); c+=1
         self.item19 = int(foo[c]); c+=1
         self.item20 = float(foo[c]); c+=1
         self.item21 = float(foo[c]); c+=1
         self.item22 = int(foo[c]); c+=1
         self.item23 = float(foo[c]); c += 1
         self.item24 = float(foo[c]); c += 1
         self.item25 = float(foo[c]); c+=1
         self.item26 = int(foo[c]); c+=1
         self.item27 = bool(int(foo[c])); c+=1
         self.item28 = float(foo[c]); c+=1
         self.item29 = float(foo[c]); c+=1
         self.item30 = (foo[c] == "1"); c += 1
         self.item31 = (foo[c] == "1"); c += 1
         self.item32 = float(foo[c]); c += 1
         self.item33 = float(foo[c]); c += 1
         self.item34 = int(foo[c]); c += 1
         self.item35 = float(foo[c]); c += 1
         self.item36 = (foo[c] == "1"); c+=1
         self.item37 = (foo[c] == "1"); c+=1
         self.item38 = float(foo[c]); c += 1
         self.item39 = float(foo[c]); c += 1
         self.item40 = int(foo[c]); c += 1
         self.item41 = float(foo[c]); c += 1
         self.item42 = (foo[c] == "1"); c+=1
         self.item43 = float(foo[c]); c+=1
         self.item44 = float(foo[c]); c+=1
         self.item45 = float(foo[c]); c += 1
         self.item46 = int(foo[c]); c+=1
         self.item47 = bool(int(foo[c])); c+=1
         return c

######################################################################## 
###

class HugeClass:
     def __init__(self,line):
         self.clear()
         self.input(line)
         return
     def __del__(self):
         del self.B4v
         return
     def clear(self):
         self.long_classes = {}
         self.B4v={}
         return
     def input(self, line):

         try:
             foo = line.strip().split(',')
             c = 0

             self.asciiVersion = float(foo[c])
             c += 1

             self.item0 = foo[c]; c += 1
             self.item1 = (self.item0 != "0")

             self.item2 = (foo[c] == "1"); c += 1

             self.item3=int(foo[c]); c+=1
             self.item4=int(foo[c]); c+=1
             self.item5=int(foo[c]); c+=1
             self.item6=int(foo[c]); c += 1
             self.item7=float(foo[c]); c+=1

             self.item8 = foo[c]; c += 1
             bit_item = int(self.item8)
             self.item9 = bool(bit_item & 2048)
             self.item10 = bool(bit_item & 1024)
             self.item11 = bool(bit_item & 512)
             self.item12 = bool(bit_item & 256)
             self.item13 = bool(bit_item & 128)
             self.item14 = bool(bit_item & 64)
             self.item15 = bool(bit_item & 32)
             self.item16 = bool(bit_item & 16)
             self.item17 = bool(bit_item & 8)
             self.item18 = bool(bit_item & 4)
             self.item19 = bool(bit_item & 2)
             self.item20 = bool(bit_item & 1)

             self.item21 = int(foo[c]); c+=1
             self.item22 = float(foo[c]); c+=1
             self.item23 = int(foo[c]); c+=1
             self.item24 = float(foo[c]); c+=1

             self.item25 = float(foo[c]); c+=1

             self.item26 = foo[c]; c+=1
             self.item27 = int(foo[c]); c+=1
             self.item28 = int(foo[c]); c+=1

             self.item29 = ModerateClass()
             c = self.item29.input(foo, c)

             self.item30 = int(foo[c]); c+=1
             self.item31 = int(foo[c]); c+=1

             for i in xrange(self.item31):
                 unit = LongClass()
                 c = unit.input(foo, c)
                 self.long_classes[len(self.long_classes)] = unit

             assert(c == len(foo)), "ERROR We did not read the whole  
line!!!"

         except (ValueError,IndexError), msg:
             print >> sys.stderr, \
                   "ERROR Trouble reading line: `%(msg)s'" % vars()
             self.clear()
             return
         return

######################################################################## 
###

def readLines(f):
     DATA = []
     f.seek(0)

     time_a = time.time()
     for i in f:
         DATA.append(i)
     time_b = time.time()

     time_spent_reading = time_b - time_a
     print "DEBUG readLines took %.3f s" % time_spent_reading

     return DATA

######################################################################## 
###

def ReadClasses(filename):

     print 'Now reading ...'

     built_classes = {}

     # Read lines from file
     in_file = open(filename, 'r')
     LINES = readLines(in_file)
     in_file.close()

     # and interpret them.
     for i in LINES:
## This is alternative 1.
         built_classes[len(built_classes)] = HugeClass(long_line)
## The next line is alternative 2.
##        built_classes[len(built_classes)] = long_line

     del LINES

     return

######################################################################## 
###

def ProcessList():

     input_files = ["./test_file0.txt",
                    "./test_file0.txt"]

     # Loop over all files that we found.
     nfiles = len(input_files)
     file_index = 0
     for i in input_files:
         print "--> %i/%i: %s" % (file_index+1, nfiles, i)
         ReadClasses(i)
         file_index += 1

     return

######################################################################## 
###

if __name__ == "__main__":
     ProcessList()

     sys.exit(0)

######################################################################## 
###






More information about the Python-list mailing list