Newbie completely confused

Tue Sep 25 06:03:04 EDT 2007

Jeroen Hegeman schreef:
> Thanks for the comments,
> 
>> (First, I had to add timing code to ReadClasses: the code you posted
>> doesn't include them, and only shows timings for ReadLines.)
>>
>> Your program uses quite a bit of memory. I guess it gets harder and
>> harder to allocate the required amounts of memory.
> 
> Well, I guess there could be something in that, but why is there a  
> significant increase after the first time? And after that, single- 
> trip time pretty much flattens out. No more obvious increases.

Sorry, I have no idea.

>> If I change this line in ReadClasses:
>>
>>          built_classes[len(built_classes)] = HugeClass(long_line)
>>
>> to
>>
>> 	dummy = HugeClass(long_line)
>>
>> then both times the files are read and your data structures are built,
>> but after each run the data structure is freed. The result is that  
>> both
>> runs are equally fast.
> 
> Isnt't the 'del LINES' supposed to achieve the same thing? And  
> really, reading 30MB files should not be such a problem, right? (I'm  
> also running with 1GB of RAM.)

'del LINES' deletes the lines that are read from the file, but not all 
of your data structures that you created out of them.
Now, indeed, reading 30 MB files should not be a problem. And I am 
confident that just reading the data is not a problem. To make sure I 
created a simple test:

import time

input_files = ["./test_file0.txt", "./test_file1.txt"]

total_start = time.time()
data = {}
for input_fn in input_files:
     file_start = time.time()
     f = file(input_fn, 'r')
     data[input_fn] = f.read()
     f.close()
     file_done = time.time()
     print '%s: %f to read %d bytes' % (input_fn, file_done - 
file_start, len(data))
total_done = time.time()
print 'all done in %f' % (total_done - total_start)

When I run that with test_file0.txt and test_file1.txt as you described 
(each 30 MB), I get this output:

./test_file0.txt: 0.260000 to read 1 bytes
./test_file1.txt: 0.251000 to read 2 bytes
all done in 0.521000

Therefore I think the problem is not in reading the data, but in 
processing it and creating the data structures.

>> You read the files, but don't use the contents; instead you use
>> long_line over and over. I suppose you do that because this is a test,
>> not your actual code?
> 
> Yeah ;-) (Do I notice a lack of trust in the responses I get? Should  
> I not mention 'newbie'?)

I didn't mean to attack you; it's just that the program reads 30 MB of 
data, twice, but doesn't do anything with it. It only uses the data that 
was stored in long_lines, and which never is replaced. That is very 
strange for real code, but as a test it can have it's uses. That's why I 
asked.

> Let's get a couple of things out of the way:
> - I do know about meaningful variable names and case-conventions,  
> but ... First of all I also have to live with inherited code (I don't  
> like people shouting in their code either), and secondly (all the  
> itemx) most of these members normally _have_ descriptive names but  
> I'm not supposed to copy-paste the original code to any newsgroups.

Ok.

> - I also know that a plain 'return' in python does not do anything  
> but I happen to like them. Same holds for the sys.exit() call.

Ok.

> - The __init__ methods normally actually do something: they  
> initialise some member variables to meaningful values (by calling the  
> clear() method, actually).
> - The __clear__ method normally brings objects back into a well- 
> defined 'empty' state.
> - The __del__ methods are actually needed in this case (well, in the  
> _real_ code anyway). The python code loads a module written in C++  
> and some of the member variables actually point to C++ objects  
> created dynamically, so one actually has to call their destructors  
> before unbinding the python var.

That sounds a bit weird to me; I would think such explicit memory 
management belongs in the C++ code instead of in the Python code, but I 
must admit that I know next to nothing about extending Python so I 
assume you are right.

> All right, thanks for the tips. I guess the issue itself is still  
> open, though.

I'm afraid so. Sorry I can't help.

One thing that helped me in the past to speed up input is using memory 
mapped I/O instead of stream I/O. But that was in C++ on Windows; I 
don't know if the same applies to Python on Linux.

-- 
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
   -- Isaac Asimov

Roel Schroeven