regex over files

Tue Apr 26 14:32:29 EDT 2005

Skip Montanaro wrote:
>     Robin> So we avoid dirty page writes etc etc. However, I still think I
>     Robin> could get away with a small window into the file which would be
>     Robin> more efficient.
> 
> It's hard to imagine how sliding a small window onto a file within Python
> would be more efficient than the operating system's paging system. ;-)
> 
> Skip
well it might be if I only want to scan forward through the file (think lexical 
analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie 
a compressed version of the input. There are problems crossing buffers etc, but 
we never normally need the whole file in memory.

If the lexical analyzer reads the whole file into memory then we need more 
pages. The mmap thing might help as we need only read pages (for a lexical scanner).

Scanners work by detecting the transitions between tokens so even if the tokens 
are very long we don't need to store them twice (in the input stream and token 
accumulator); I suppose that could be true of regex pattern matchers, but it 
doesn't seem to be for re ie we need the entire pattern in the input before we 
can match and extract to an accumulator.
-- 
Robin Becker