regex over files

Robin Becker robin at reportlab.com
Thu Apr 28 04:07:02 EDT 2005


Skip Montanaro wrote:
......
> 
> I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
> some systems files opened for reading are mmap'd under the covers.  I'm sure
> it's highly platform-dependent.  (My results on MacOSX - see below - are
> somewhat better.)
> 

I'll have a go at doing the experiment on some other platforms I have 
available. The problem is certainly paging related. Perhaps the fact 
that we don't need to write dirty pages is moot when the system is 
actually writing out other processes' pages to make room for the 
incoming ones needed by the cpu hog. I do know that I cannot control 
that in detail. Also it's entirely possible that file caching/readahead 
etc etc can skew the results.

All my old compiler texts recommend the buffered read approach, but that 
might be because mmap etc weren't around. Perhaps some compiler expert 
can say? Also I suspect that in a low level language the minor overhead 
caused by the book keeping is lower than that for the paging code.

> Let me return to your original problem though, doing regex operations on
> files.  I modified your two scripts slightly:
> 
......
> I took the file from Bengt Richter's example and replicated it a bunch of
> times to get a 122MB file.  I then ran the above two programs against it:
> 
>     % python tscan1.py splitX
>     n=2112001 time=8.88
>     % python tscan0.py splitX
>     n=2139845 time=10.26
> 
> So the mmap'd version is within 15% of the performance of the buffered read
> version and we don't have to solve the problem of any corner cases (note the
> different values of n).  I'm happy to take the extra runtime in exchange for
> simpler code.
> 
> Skip

I will have a go at repeating this on my system. Perhaps with Bengt's 
code in the buffered case as that would be more realistic.

It has been my experience that all systems crawl when driven into the 
swapping region and some users of our code seem anxious to run huge 
print jobs.
-- 
Robin Becker



More information about the Python-list mailing list