streaming a file object through re.finditer

Christos TZOTZIOY Georgiou tzot at sil-tec.gr
Thu Feb 3 08:43:26 EST 2005


On Wed, 2 Feb 2005 22:22:27 -0500, rumours say that Daniel Bickett
<dbickett at gmail.com> might have written:

>Erick wrote:
>> True, but it doesn't work with multiline regular expressions :(

>If your intent is for the expression to traverse multiple lines (and
>possibly match *across* multiple lines,) then, as far as I know, you
>have no choice but to load the whole file into memory.

*If* the OP knows that their multiline re won't match more than, say, 4 lines at
a time, the code attached at the end of this post could be useful.  Usage:

for group_of_lines in line_groups(<file>, line_count=4):
    # bla bla

The OP should take care to ignore multiple matches as the n-line window scans
through the input file; eg. if your re searches for '3\n4', it will match 3
times in the first example of my code.

|import collections
|
|def line_groups(fileobj, line_count=2):
|    iterator = iter(fileobj)
|    group = collections.deque()
|    joiner = ''.join
|
|    try:
|        while len(group) < line_count:
|            group.append(iterator.next())
|    except StopIteration:
|        yield joiner(group)
|        return
|
|    for line in iterator:
|        group.append(line)
|        del group[0]
|        yield joiner(group)
|
|if __name__=="__main__":
|    import os, tempfile
|
|    # create two temp file for 4-line groups
|
|    # write n+3 lines in first file
|    testname1= tempfile.mktemp() # depracated & insecure but ok for this test
|    testfile= open(testname1, "w")
|    testfile.write('\n'.join(map(str, range(7))))
|    testfile.close()
|
|    # write n-2 lines in second file
|    testname2= tempfile.mktemp()
|    testfile= open(testname2, "w")
|    testfile.write('\n'.join(map(str, range(2))))
|    testfile.close()
|
|    # now iterate over four line groups
|
|    for bunch_o_lines in line_groups( open(testname1), line_count=4):
|        print repr(bunch_o_lines),
|    print
|
|    for bunch_o_lines in line_groups( open(testname2), line_count=4):
|        print repr(bunch_o_lines),
|    print
|
|    os.remove(testname1); os.remove(testname2)

-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...



More information about the Python-list mailing list