regex over files
Bengt Richter
bokr at oz.net
Wed Apr 27 00:34:33 EDT 2005
On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <robin at reportlab.com> wrote:
>Is there any way to get regexes to work on non-string/unicode objects. I would
>like to split large files by regex and it seems relatively hard to do so without
>having the whole file in memory. Even with buffers it seems hard to get regexes
>to indicate that they failed because of buffer termination and getting a partial
>match to be resumable seems out of the question.
>
>What interface does re actually need for its src objects?
ISTM splitting is a special situation where you can easily
chunk through a file and split as you go, since if splitting
the current chunk succeeds, you can be sure that all but the
tail piece is valid[1]. So you can make an iterator that yields
all but the last and then sets the buffer to last+newchunk
and goes on until there are no more chunks, and the tail part
will be a valid split piece. E.g., (not tested beyond what you see ;-)
>>> def frxsplit(path, rxo, chunksize=8192):
... buffer = ''
... for chunk in iter((lambda f=open(path): f.read(chunksize)),''):
... buffer += chunk
... pieces = rxo.split(buffer)
... for piece in pieces[:-1]: yield piece
... buffer = pieces[-1]
... yield buffer
...
>>> import re
>>> rxo = re.compile('XXXXX')
The test file:
>>> print '----\n%s----'%open('tsplit.txt').read()
----
This is going to be split on five X's
like XXXXX but we will use a buffer of
XXXXX length 2 to force buffer appending.
We'll try a splitter at the end: XXXXX
----
>>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)
...
"This is going to be split on five X's\nlike "
' but we will use a buffer of\n'
" length 2 to force buffer appending.\nWe'll try a splitter at the end: "
'\n'
>>> rxo = re.compile('(XXXXX)')
>>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)
...
"This is going to be split on five X's\nlike "
'XXXXX'
' but we will use a buffer of\n'
'XXXXX'
" length 2 to force buffer appending.\nWe'll try a splitter at the end: "
'XXXXX'
'\n'
[1] In some cases of regexes with lookahead context, you might
have to check that the last piece not only exists but exceeds
max lookahead length, in case there is a <withlookahead>|<plain>
kind of thing in the regex where <lookahead> would have succeeded
with another chunk appended to buffer, but <plain> did the split.
Regards,
Bengt Richter
More information about the Python-list
mailing list