tail

Sat Apr 23 18:04:35 EDT 2022

On 24/04/2022 09.15, Chris Angelico wrote:
> On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
>>
>> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav at gmail.com> wrote:
>>>>> This is quite inefficient in general.
>>>>
>>>> Why inefficient? I think that readlines() will be much slower, not
>>>> only more time consuming.
>>>
>>> It depends on which is more costly: reading the whole file (cost
>>> depends on size of file) or reading chunks and splitting into lines
>>> (cost depends on how well you guess at chunk size). If the lines are
>>> all *precisely* the same number of bytes each, you can pick a chunk
>>> size and step backwards with near-perfect efficiency (it's still
>>> likely to be less efficient than reading a file forwards, on most file
>>> systems, but it'll be close); but if you have to guess, adjust, and
>>> keep going, then you lose efficiency there.
>>
>> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
...

> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> single byte at a time, then reading forwards. That is NOT going to
> play nicely with file systems or buffers.
> 
> Compare reading line by line over the file with readlines() and you'll
> see how abysmal this is.
> 
> If you really only need one line (which isn't what your original post
> suggested), I would recommend starting with a chunk that is likely to
> include a full line, and expanding the chunk until you have that
> newline. Much more efficient than one byte at a time.

Disagreeing with @Chris in the sense that I use tail very frequently,
and usually in the context of server logs - but I'm talking about the
Linux implementation, not Python code!

Agree with @Chris' assessment of the (in)efficiency. It is more likely
than not, that you will have a good idea of the length of each line.
Even if the line-length is highly-variable (thinking of some of my
applications of the Python logging module!), one can still 'take a stab
at it' (a "thumb suck" as an engineer-colleague used to say - apparently
not an electrical engineer!) by remembering that lines exceeding
80-characters become less readable and thus have likely?hopefully been
split into two.

Thus,

    N*(80+p)

where N is the number of lines desired and p is a reasonable
'safety'/over-estimation percentage, would give a good chunk size.
Binar-ily grab that much of the end of the file, split on line-ending,
and take the last N elements from that list. (with 'recovery code' just
in case the 'chunk' wasn't large-enough).

Adding to the efficiency (of the algorithm, but not the dev-time),
consider that shorter files are likely to be more easily--handled by
reading serially from the beginning. To side-step @Chris' criticism, use
a generator to produce the individual lines (lazy evaluation and low
storage requirement) and feed them into a circular-queue which is
limited to N-entries. QED, as fast as the machine's I/O, and undemanding
of storage-space!

Running a few timing trials should reveal the 'sweet spot', at which one
algorithm takes-over from the other!

NB quite a few of IBM's (extensively researched) algorithms which formed
utility program[me]s on mainframes, made similar such algorithmic
choices, in the pursuit of efficiencies.
-- 
Regards,
=dn