Newbie completely confused

Jeroen Hegeman jeroen.hegeman at gmail.com
Mon Sep 24 11:50:57 EDT 2007


Thanks for the comments,

>
> (First, I had to add timing code to ReadClasses: the code you posted
> doesn't include them, and only shows timings for ReadLines.)
>
> Your program uses quite a bit of memory. I guess it gets harder and
> harder to allocate the required amounts of memory.

Well, I guess there could be something in that, but why is there a  
significant increase after the first time? And after that, single- 
trip time pretty much flattens out. No more obvious increases.

>
> If I change this line in ReadClasses:
>
>          built_classes[len(built_classes)] = HugeClass(long_line)
>
> to
>
> 	dummy = HugeClass(long_line)
>
> then both times the files are read and your data structures are built,
> but after each run the data structure is freed. The result is that  
> both
> runs are equally fast.

Isnt't the 'del LINES' supposed to achieve the same thing? And  
really, reading 30MB files should not be such a problem, right? (I'm  
also running with 1GB of RAM.)

> I'm not sure how to speed things up here... you're doing much  
> processing
> on a lot of small chunks of data. I have a number of observations and
> possible improvements though, and some might even speed things up a  
> bit.

Cool thanks, let's go over them.

>
> You read the files, but don't use the contents; instead you use
> long_line over and over. I suppose you do that because this is a test,
> not your actual code?

Yeah ;-) (Do I notice a lack of trust in the responses I get? Should  
I not mention 'newbie'?)

Let's get a couple of things out of the way:
- I do know about meaningful variable names and case-conventions,  
but ... First of all I also have to live with inherited code (I don't  
like people shouting in their code either), and secondly (all the  
itemx) most of these members normally _have_ descriptive names but  
I'm not supposed to copy-paste the original code to any newsgroups.
- I also know that a plain 'return' in python does not do anything  
but I happen to like them. Same holds for the sys.exit() call.
- The __init__ methods normally actually do something: they  
initialise some member variables to meaningful values (by calling the  
clear() method, actually).
- The __clear__ method normally brings objects back into a well- 
defined 'empty' state.
- The __del__ methods are actually needed in this case (well, in the  
_real_ code anyway). The python code loads a module written in C++  
and some of the member variables actually point to C++ objects  
created dynamically, so one actually has to call their destructors  
before unbinding the python var.

I tried to get things down to as small as possible, but when I found  
out that the size of the classes seems to contribute to the issue  
(removing enough member variables will bring you to a point where all  
of a sudden the speed increases a factor ten, there seems to be some  
breakpoint depending on the size of the classes) I could not simply  
remove all members but had to give them funky names. I kept the main  
structure of things, though, to see if that would solicit comments.  
(And it did...)

>
>
> In a number of cases, you use a dict like this:
>
>      built_classes  = {}
>      for i in LINES:
>          built_classes[len(built_classes)] = ...
>
> So you're using the indices 0, 1, 2, ... as the keys. That's not what
> dictionaries are made for; lists are much better for that:
>
>      built_classes = []
>      for i  in LINES:
>          built_classes.append(...)

Yeah, I inherited that part...

>
> Your readLines() function reads a whole file into memory. If you're
> working with large files, that's not such a good idea. It's better to
> load one line at a time into memory and work on that. I would even
> completely remove readLines() and restructure ReadClasses() like this:

Actually, part of what I removed was the real reason why readLines()  
is there at all: it reads files in blocks of (at most) some_number  
lines, and keeps track of the line offset in the file. I kept this  
structure hoping that someone would point out something obvious like  
some internal buffer going out of scope or whatever.

All right, thanks for the tips. I guess the issue itself is still  
open, though.

Cheers,
Jeroen

Jeroen Hegeman
jeroen DOT hegeman AT gmail DOT com

WARNING: This message may contain classified information. Immediately  
burn this message after reading.






More information about the Python-list mailing list