tail

Chris Angelico rosuav at gmail.com
Sun Apr 24 11:58:37 EDT 2022


On Mon, 25 Apr 2022 at 01:47, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
>
>
>
> On Sat, 23 Apr 2022 at 23:18, Chris Angelico <rosuav at gmail.com> wrote:
>>
>> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
>> single byte at a time, then reading forwards. That is NOT going to
>> play nicely with file systems or buffers.
>>
>> Compare reading line by line over the file with readlines() and you'll
>> see how abysmal this is.
>>
>> If you really only need one line (which isn't what your original post
>> suggested), I would recommend starting with a chunk that is likely to
>> include a full line, and expanding the chunk until you have that
>> newline. Much more efficient than one byte at a time.
>
>
> Well, I would like to have a sort of tail, so to generalise to more than 1 line. But I think that once you have a good algorithm for one line, you can repeat it N times.
>

Not always. If you know you want to read 5 lines, it's much more
efficient than reading 1 line, then going back to the file, five
times. Disk reads are the costliest part, with the possible exception
of memory usage (but usually only because it can cause additional disk
*writes*).

> I understand that you can read a chunk instead of a single byte, so when the newline is found you can return all the cached chunks concatenated. But will this make the search of the start of the line faster? I suppose you have always to read byte by byte (or more, if you're using urf16 etc) and see if there's a newline.
>

Massively massively faster. Try it. Especially, try it on an
artificially slow file system, so you can see what it costs.

But you can't rely on any backwards reads unless you know for sure
that the encoding supports this. UTF-8 does (you have to scan
backwards for a start byte), UTF-16 does (work with pairs of bytes and
check for surrogates), and fixed-width encodings do, but otherwise,
you won't necessarily know when you've found a valid start point. So
any reverse-read algorithm is going to be restricted to specific
encodings.

ChrisA


More information about the Python-list mailing list