tail

Chris Angelico rosuav at gmail.com
Sat Apr 23 18:11:39 EDT 2022


On Sun, 24 Apr 2022 at 08:03, Peter J. Holzer <hjp-python at hjp.at> wrote:
>
> On 2022-04-24 04:57:20 +1000, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> > > What about introducing a method for text streams that reads the lines
> > > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > > Commons IO.
> >
> > It's fundamentally difficult to get precise. In general, there are
> > three steps to reading the last N lines of a file:
> >
> > 1) Find out the size of the file (currently, if it's being grown)
> > 2) Seek to the end of the file, minus some threshold that you hope
> > will contain a number of lines
> > 3) Read from there to the end of the file, split it into lines, and
> > keep the last N
> [...]
> > This is quite inefficient in general. It would be far FAR easier to do
> > this instead:
> >
> > 1) Read the entire file and decode bytes to text
> > 2) Split into lines
> > 3) Iterate backwards over the lines
>
> Which one is more efficient depends very much on the size of the file.
> For a file of a few kilobytes, the second solution is probably more
> efficient. But for a few gigabytes, that's almost certainly not the
> case.

Yeah. I said "easier", not necessarily more efficient. Which is more
efficient is a virtually unanswerable question (will you need to
iterate over the whole file or stop part way? Is the file stored
contiguously? Can you memory map it in some way?), so it's going to
depend a lot on your use-case.

> > Tada! Done. And in Python, quite easy. The downside, of course, is
> > that you have to store the entire file in memory.
>
> Not just memory. You have to read the whole file in the first place. Which is
> hardly efficient if you only need a tiny fraction.

Right - if that's the case, then the chunked form, even though it's
harder, would be worth doing.

> > Personally, unless the file is tremendously large and I know for sure
> > that I'm not going to end up iterating over it all, I would pay the
> > memory price.
>
> Me, too. Problem with a library function (as Marco proposes) is that you
> don't know how it will be used.
>

Yup. And there may be other options worth considering, like
maintaining an index (a bunch of "line 142857 is at byte position
3141592" entries) which would allow random access... but at some
point, if your file is that big, you probably shouldn't be storing it
as a file of lines of text. Use a database instead.

Reading a text file backwards by lines is, by definition, hard. Every
file format I know of that involves starting at the end of the file is
defined in binary, so you can actually seek, and is usually defined
with fixed-size structures (so you just go "read the last 768 bytes of
the file" or something).

ChrisA


More information about the Python-list mailing list