regex over files
Robin Becker
robin at reportlab.com
Thu Apr 28 04:07:02 EDT 2005
Skip Montanaro wrote:
......
>
> I'm not sure why the mmap() solution is so much slower for you. Perhaps on
> some systems files opened for reading are mmap'd under the covers. I'm sure
> it's highly platform-dependent. (My results on MacOSX - see below - are
> somewhat better.)
>
I'll have a go at doing the experiment on some other platforms I have
available. The problem is certainly paging related. Perhaps the fact
that we don't need to write dirty pages is moot when the system is
actually writing out other processes' pages to make room for the
incoming ones needed by the cpu hog. I do know that I cannot control
that in detail. Also it's entirely possible that file caching/readahead
etc etc can skew the results.
All my old compiler texts recommend the buffered read approach, but that
might be because mmap etc weren't around. Perhaps some compiler expert
can say? Also I suspect that in a low level language the minor overhead
caused by the book keeping is lower than that for the paging code.
> Let me return to your original problem though, doing regex operations on
> files. I modified your two scripts slightly:
>
......
> I took the file from Bengt Richter's example and replicated it a bunch of
> times to get a 122MB file. I then ran the above two programs against it:
>
> % python tscan1.py splitX
> n=2112001 time=8.88
> % python tscan0.py splitX
> n=2139845 time=10.26
>
> So the mmap'd version is within 15% of the performance of the buffered read
> version and we don't have to solve the problem of any corner cases (note the
> different values of n). I'm happy to take the extra runtime in exchange for
> simpler code.
>
> Skip
I will have a go at repeating this on my system. Perhaps with Bengt's
code in the buffered case as that would be more realistic.
It has been my experience that all systems crawl when driven into the
swapping region and some users of our code seem anxious to run huge
print jobs.
--
Robin Becker
More information about the Python-list
mailing list