tail

Sat Apr 23 18:19:36 EDT 2022

On Sun, 24 Apr 2022 at 08:06, dn <PythonList at danceswithmice.info> wrote:
>
> On 24/04/2022 09.15, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> >>
> >> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav at gmail.com> wrote:
> >>>>> This is quite inefficient in general.
> >>>>
> >>>> Why inefficient? I think that readlines() will be much slower, not
> >>>> only more time consuming.
> >>>
> >>> It depends on which is more costly: reading the whole file (cost
> >>> depends on size of file) or reading chunks and splitting into lines
> >>> (cost depends on how well you guess at chunk size). If the lines are
> >>> all *precisely* the same number of bytes each, you can pick a chunk
> >>> size and step backwards with near-perfect efficiency (it's still
> >>> likely to be less efficient than reading a file forwards, on most file
> >>> systems, but it'll be close); but if you have to guess, adjust, and
> >>> keep going, then you lose efficiency there.
> >>
> >> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
> ...
>
> > Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> > single byte at a time, then reading forwards. That is NOT going to
> > play nicely with file systems or buffers.
> >
> > Compare reading line by line over the file with readlines() and you'll
> > see how abysmal this is.
> >
> > If you really only need one line (which isn't what your original post
> > suggested), I would recommend starting with a chunk that is likely to
> > include a full line, and expanding the chunk until you have that
> > newline. Much more efficient than one byte at a time.
>
>
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!

tail(1) doesn't read a single byte at a time. It works in chunks,
more-or-less the way I described. (That's where I borrowed the
technique from.) It finds one block of lines, and displays those; it
doesn't iterate backwards.

(By the way, the implementation of "tail -f" is actually less
complicated than you might think, as long as inotify is available.
That's been much more useful to me than reading a file backwards.)

> Agree with @Chris' assessment of the (in)efficiency. It is more likely
> than not, that you will have a good idea of the length of each line.
> Even if the line-length is highly-variable (thinking of some of my
> applications of the Python logging module!), one can still 'take a stab
> at it' (a "thumb suck" as an engineer-colleague used to say - apparently
> not an electrical engineer!) by remembering that lines exceeding
> 80-characters become less readable and thus have likely?hopefully been
> split into two.
>
> Thus,
>
>     N*(80+p)
>
> where N is the number of lines desired and p is a reasonable
> 'safety'/over-estimation percentage, would give a good chunk size.
> Binar-ily grab that much of the end of the file, split on line-ending,
> and take the last N elements from that list. (with 'recovery code' just
> in case the 'chunk' wasn't large-enough).

Yup, that's the broad idea of the chunked read. If you know how many
lines you're going to need, that's not too bad. If you need to iterate
backwards over the file (as the original question suggested), that
gets complicated fast.

> Adding to the efficiency (of the algorithm, but not the dev-time),
> consider that shorter files are likely to be more easily--handled by
> reading serially from the beginning. To side-step @Chris' criticism, use
> a generator to produce the individual lines (lazy evaluation and low
> storage requirement) and feed them into a circular-queue which is
> limited to N-entries. QED, as fast as the machine's I/O, and undemanding
> of storage-space!

Still needs to read the whole file though.

> Running a few timing trials should reveal the 'sweet spot', at which one
> algorithm takes-over from the other!

Unfortunately, the sweet spot will depend on a lot of things other
than just the file size. Is it stored contiguously? Is it on hard
drive or SSD? Can it be read from the disk cache? How frequently do
the chunk size guesses fail, and how often do they grab more than is
necessary?

It's basically unknowable, so the best you can do is judge your own
app, pick one, and go with it.

> NB quite a few of IBM's (extensively researched) algorithms which formed
> utility program[me]s on mainframes, made similar such algorithmic
> choices, in the pursuit of efficiencies.

Indeed. Just make a choice and run with it, and accept that there will
be situations where it'll be inefficient.

ChrisA