How to read from a file to an arbitrary delimiter efficiently?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Feb 25 01:50:51 EST 2016


I have a need to read to an arbitrary delimiter, which might be any of a 
(small) set of characters. For the sake of the exercise, lets say it is 
either ! or ? (for example).

I want to read from files reasonably efficiently. I don't mind if there is a 
little overhead, but my first attempt is 100 times slower than the built-in 
"read to the end of the line" method.

Here is the function I came up with:


# Read a chunk of bytes/characters from an open file.
def chunkiter(f, delim):
    buffer = []
    b = f.read(1)
    while b:
        buffer.append(b)
        if b in delim:
            yield ''.join(buffer)
            buffer = []
        b = f.read(1)
    if buffer:
        yield ''.join(buffer)



And here is some test code showing how slow it is:


# Create a test file.
FILENAME = '/tmp/foo'
s = """\
abcdefghijklmnopqrstuvwxyz!
abcdefghijklmnopqrstuvwxyz?
""" * 500
with open(FILENAME, 'w') as f:
    f.write(s)


# Run some timing tests, comparing to reading lines from a file.

def readlines(f):
    f.seek(0)
    for line in f:
        pass

def readchunks(f):
    f.seek(0)
    for chunk in chunkiter(f, '!?'):
        pass

from timeit import Timer
SETUP = 'from __main__ import readlines, readchunks, FILENAME; '
SETUP += 'open(FILENAME)'

t1 = Timer('readlines(f)', SETUP)
t2 = Timer('readchunks(f)', SETUP)

# Time them.
x = t1.repeat(number=10)  # Ignore the first run, in case of caching issues.
x = min(t1.repeat(number=1000, repeat=9))

y = t2.repeat(number=10)
y = min(t2.repeat(number=1000, repeat=9))

print('reading lines:', x, 'reading chunks:', y)






On my laptop, the results I get are:

reading lines: 0.22584209218621254 reading chunks: 21.716224210336804


Is there a better way to read chunks from a file up to one of a set of 
arbitrary delimiters? Bonus for it working equally well with text and bytes.

(You can assume that the delimiters will be no more than one byte, or 
character, each. E.g. "!" or "?", but never "!?" or "?!".)

-- 
Steve




More information about the Python-list mailing list