regex over files

Jeremy Bowers jerf at jerf.org
Tue Apr 26 15:25:57 EDT 2005


On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote:

> Skip Montanaro wrote:
>>     Robin> So we avoid dirty page writes etc etc. However, I still think I
>>     Robin> could get away with a small window into the file which would be
>>     Robin> more efficient.
>> 
>> It's hard to imagine how sliding a small window onto a file within Python
>> would be more efficient than the operating system's paging system. ;-)
>> 
>> Skip
> well it might be if I only want to scan forward through the file (think lexical 
> analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie 
> a compressed version of the input. There are problems crossing buffers etc, but 
> we never normally need the whole file in memory.

I think you might have a misunderstanding here. mmap puts a file into
*virtual* memory. It does *not* read the whole thing into physical memory;
if it did, there would be no purpose to mmap support in the OS in the
first place, as a thin wrapper around existing file calls would work.

> If the lexical analyzer reads the whole file into memory then we need more 
> pages. The mmap thing might help as we need only read pages (for a lexical scanner).

The read-write status of the pages is not why mmap is an advantage; the
advantage is that the OS naturally and transparent is taking care of
loading just the portions you want, and intelligently discarding them when
you are done (more intelligently than you could, even in theory, since it
can take advantage of knowing the entire state of the system, your program
can't). 

In other words, as Skip was trying to tell you, mmap *already
does* what you are saying might be better, and it does it better than you
can, even in theory, from inside a process (as the OS will not reveal to
you the data structures it has that you would need to match that
performance).

As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and it does not even *try* to read
the whole thing in at any given time. You may benefit from
reviewing/studying the difference between virtual memory and physical
memory.



More information about the Python-list mailing list