regex over files
Robin Becker
robin at reportlab.com
Tue Apr 26 14:32:29 EDT 2005
Skip Montanaro wrote:
> Robin> So we avoid dirty page writes etc etc. However, I still think I
> Robin> could get away with a small window into the file which would be
> Robin> more efficient.
>
> It's hard to imagine how sliding a small window onto a file within Python
> would be more efficient than the operating system's paging system. ;-)
>
> Skip
well it might be if I only want to scan forward through the file (think lexical
analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie
a compressed version of the input. There are problems crossing buffers etc, but
we never normally need the whole file in memory.
If the lexical analyzer reads the whole file into memory then we need more
pages. The mmap thing might help as we need only read pages (for a lexical scanner).
Scanners work by detecting the transitions between tokens so even if the tokens
are very long we don't need to store them twice (in the input stream and token
accumulator); I suppose that could be true of regex pattern matchers, but it
doesn't seem to be for re ie we need the entire pattern in the input before we
can match and extract to an accumulator.
--
Robin Becker
More information about the Python-list
mailing list