How to read from a file to an arbitrary delimiter efficiently?

BartC bc at freeuk.com
Sat Feb 27 15:03:42 EST 2016


On 27/02/2016 16:35, BartC wrote:
> On 25/02/2016 06:50, Steven D'Aprano wrote:
>> I have a need to read to an arbitrary delimiter, which might be any of a
>> (small) set of characters. For the sake of the exercise, lets say it is
>> either ! or ? (for example).

> However those aren't the main reasons for the poor speed. The limiting
> factor here is reading one byte at a time. Just a loop like this:
>
>     while f.read(1):
>        pass
>
> without doing anything else, seems to take most of the time. (3.6
> seconds, compared with 5.6 seconds of your readchunks() on a 6MB version
> of your test file, on Python 2.7. readlines() took about 0.2 seconds.)
>
> Any faster solutions would need to read more than one byte at a time.

I've done some more test using Python 3.4, with the same 200,000 line 
6MB test file:

0.25 seconds       Scan the file with 'for line in f'
2.25 seconds       Scan the file with your readlines() routine
4.0  seconds       Scan the file with your readchunks() routine
0.65 seconds       Scan the file with using a buffer

This latter test uses a 64-byte buffer, reading not more than an extra 
63 bytes, but resetting the file position to just past the end of of 
each identified chunk so that any subsequent read works as expected.

This test (the code is too untidy to post) only checks for two specific 
delimiters (not an arbitrary string fill of them). (It also counts EOF 
as a valid delimiter so counts one more chunk.)

Increasing the buffer size doesn't help, and beyond 256 bytes slowed 
things down (for this input) as it spends too long rereading data.

-- 
Bartc



More information about the Python-list mailing list