regex over files

Wed Apr 27 00:34:33 EDT 2005

On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <robin at reportlab.com> wrote:

>Is there any way to get regexes to work on non-string/unicode objects. I would 
>like to split large files by regex and it seems relatively hard to do so without 
>having the whole file in memory. Even with buffers it seems hard to get regexes 
>to indicate that they failed because of buffer termination and getting a partial 
>match to be resumable seems out of the question.
>
>What interface does re actually need for its src objects?

ISTM splitting is a special situation where you can easily
chunk through a file and split as you go, since if splitting
the current chunk succeeds, you can be sure that all but the
tail piece is valid[1]. So you can make an iterator that yields
all but the last and then sets the buffer to last+newchunk
and goes on until there are no more chunks, and the tail part
will be a valid split piece. E.g., (not tested beyond what you see ;-)

 >>> def frxsplit(path, rxo, chunksize=8192):
 ...     buffer = ''
 ...     for chunk in iter((lambda f=open(path): f.read(chunksize)),''):
 ...         buffer += chunk
 ...         pieces = rxo.split(buffer)
 ...         for piece in pieces[:-1]: yield piece
 ...         buffer = pieces[-1]
 ...     yield buffer
 ...
 >>> import re
 >>> rxo = re.compile('XXXXX')

The test file:

 >>> print '----\n%s----'%open('tsplit.txt').read()
 ----
 This is going to be split on five X's
 like XXXXX but we will use a buffer of
 XXXXX length 2 to force buffer appending.
 We'll try a splitter at the end: XXXXX
 ----

 >>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)
 ...
 "This is going to be split on five X's\nlike "
 ' but we will use a buffer of\n'
 " length 2 to force buffer appending.\nWe'll try a splitter at the end: "
 '\n'

 >>> rxo = re.compile('(XXXXX)')
 >>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)
 ...
 "This is going to be split on five X's\nlike "
 'XXXXX'
 ' but we will use a buffer of\n'
 'XXXXX'
 " length 2 to force buffer appending.\nWe'll try a splitter at the end: "
 'XXXXX'
 '\n'

[1] In some cases of regexes with lookahead context, you might
have to check that the last piece not only exists but exceeds
max lookahead length, in case there is a <withlookahead>|<plain>
kind of thing in the regex  where <lookahead> would have succeeded
with another chunk appended to buffer, but <plain> did the split.

Regards,
Bengt Richter