Newbie completely confused

Roel Schroeven rschroev_nospam_ml at fastmail.fm
Sat Sep 22 04:47:36 EDT 2007


Jeroen Hegeman schreef:
> ...processing all 2 files found
> --> 1/2: ./test_file0.txt
> Now reading ...
> DEBUG readLines A took 0.093 s
> ...took 8.85717201233 seconds
> --> 2/2: ./test_file0.txt
> Now reading ...
> DEBUG readLines A took 3.917 s
> ...took 12.8725550175 seconds
> 
> So the first time around the file gets read in in ~0.1 seconds, the  
> second time around it needs almost four seconds! As far as I can see  
> this is related to 'something in memory being copied around' since if  
> I replace the 'alternative 1' by the 'alternative 2', basically  
> making sure that my classes are not used, reading time the second  
> time around drops back to normal (= roughly what it is the first pass).

(First, I had to add timing code to ReadClasses: the code you posted 
doesn't include them, and only shows timings for ReadLines.)

Your program uses quite a bit of memory. I guess it gets harder and 
harder to allocate the required amounts of memory.

If I change this line in ReadClasses:

         built_classes[len(built_classes)] = HugeClass(long_line)

to

	dummy = HugeClass(long_line)

then both times the files are read and your data structures are built, 
but after each run the data structure is freed. The result is that both 
runs are equally fast.

Also, if I run the first version (without the dummy) on a computer with 
a bit more memory (1 GiB), it seems there is no problem allocating 
memory: both runs are equally fast.

I'm not sure how to speed things up here... you're doing much processing 
on a lot of small chunks of data. I have a number of observations and 
possible improvements though, and some might even speed things up a bit.

You read the files, but don't use the contents; instead you use 
long_line over and over. I suppose you do that because this is a test, 
not your actual code?

__init__() with nothing (or only return) in it is not useful; better to 
just leave it out.


You have a number of return statements that don't do anything (i.e. they 
return nothing (None actually) at the end of the function). A function 
without return automatically returns None at the end, so it's better to 
leave them out.

Similarly you don't need to call sys.exit(): the script will terminate 
anyway if it reaches the end. Better leave it out.


LongClass.clear() doesn't do anything and isn't called anyway; leave it out.


ModerateClass.__del__() doesn't do anything either. I'm not sure how it 
affects what happens if ModerateClass gets freed, but I suggest you 
don't start messing with __del__() until you have more Python knowledge 
and experience. I'm not sure why you think you need to implement that 
method.
The same goes for HugeClass.__del__(). It does delete self.B4v, but the 
default behavior will do that too. Again, I don't get why you want to 
override the default behavior.


In a number of cases, you use a dict like this:

     built_classes  = {}
     for i in LINES:
         built_classes[len(built_classes)] = ...

So you're using the indices 0, 1, 2, ... as the keys. That's not what 
dictionaries are made for; lists are much better for that:

     built_classes = []
     for i  in LINES:
         built_classes.append(...)


HugeClass.B4v isn't used, so you can safely remove it.


Your readLines() function reads a whole file into memory. If you're 
working with large files, that's not such a good idea. It's better to 
load one line at a time into memory and work on that. I would even 
completely remove readLines() and restructure ReadClasses() like this:

def ReadClasses(filename):
      print 'Now reading ...'

      built_classes = []

      # Open file
      in_file = open(filename, 'r')

      # Read lines and interpret them.
      time_a = time.time()
      for i in in_file:
## This is alternative 1.
          built_classes.append(HugeClass(long_line))
## The next line is alternative 2.
##        built_classes[len(built_classes)] = long_line

      in_file.close()
      time_b = time.time()
      print "DEBUG readClasses took %.3f s" % (time_b - time_a)

Personally I only use 'i' for integer indices (as in 'for i in 
range(10)'); for other use I prefer more descriptive names:

     for line in in_file: ...

But I guess that's up to personal preference. Also you used LINES to 
store the file contents; the convention is that names with all capitals 
are used for constants, not for things that change.


In ProcessList(), you keep the index in a separate variable. Python has 
a trick so you don't have to do that yourself:

     nfiles = len(input_files)
     for file_index, i in enumerate(input_files):
         print "--> %i/%i: %s" % (file_index + 1, nfiles, i)
         ReadClasses(i)


Instead of item0, item1, ... , it's generally better to use a list, so 
you can use item[0], item[1], ...


And finally, John Machin's suggestion looks like a good way to 
restructure that long sequence of conversions and assignments in HugeClass.


-- 
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
   -- Isaac Asimov

Roel Schroeven



More information about the Python-list mailing list