tail

Wed May 11 15:58:13 EDT 2022

On Mon, 9 May 2022 at 23:15, Dennis Lee Bieber <wlfraed at ix.netcom.com>
wrote:
>
> On Mon, 9 May 2022 21:11:23 +0200, Marco Sulla
> <Marco.Sulla.Python at gmail.com> declaimed the following:
>
> >Nevertheless, tail is a fundamental tool in *nix. It's fast and
> >reliable. Also the tail command can't handle different encodings?
>
>         Based upon
> https://github.com/coreutils/coreutils/blob/master/src/tail.c the ONLY
> thing tail looks at is single byte "\n". It does not handle other line
> endings, and appears to performs BINARY I/O, not text I/O. It does nothing
> for bytes that are not "\n". Split multi-byte encodings are irrelevant
> since, if it does not find enough "\n" bytes in the buffer (chunk) it
reads
> another binary chunk and seeks for additional "\n" bytes. Once it finds
the
> desired amount, it is synchronized on the byte following the "\n" (which,
> for multi-byte encodings might be a NUL, but in any event, should be a
safe
> location for subsequent I/O).
>
>         Interpretation of encoding appears to fall to the console driver
> configuration when displaying the bytes output by tail.

Ok, I understand. This should be a Python implementation of *nix tail:

import os

_lf = b"\n"
_err_n = "Parameter n must be a positive integer number"
_err_chunk_size = "Parameter chunk_size must be a positive integer number"

def tail(filepath, n=10, chunk_size=100):
    if (n <= 0):
        raise ValueError(_err_n)

    if (n % 1 != 0):
        raise ValueError(_err_n)

    if (chunk_size <= 0):
        raise ValueError(_err_chunk_size)

    if (chunk_size % 1 != 0):
        raise ValueError(_err_chunk_size)

    n_chunk_size = n * chunk_size
    pos = os.stat(filepath).st_size
    chunk_line_pos = -1
    lines_not_found = n

    with open(filepath, "rb") as f:
        text = bytearray()

        while pos != 0:
            pos -= n_chunk_size

            if pos < 0:
                pos = 0

            f.seek(pos)
            chars = f.read(n_chunk_size)
            text[0:0] = chars
            search_pos = n_chunk_size

            while search_pos != -1:
                chunk_line_pos = chars.rfind(_lf, 0, search_pos)

                if chunk_line_pos != -1:
                    lines_not_found -= 1

                    if lines_not_found == 0:
                        break

                search_pos = chunk_line_pos

            if lines_not_found == 0:
                break

    return bytes(text[chunk_line_pos+1:])

The function opens the file in binary mode and searches only for b"\n". It
returns the last n lines of the file as bytes.

I suppose this function is fast. It reads the bytes from the file in chunks
and stores them in a bytearray, prepending them to it. The final result is
read from the bytearray and converted to bytes (to be consistent with the
read method).

I suppose the function is reliable. File is opened in binary mode and only
b"\n" is searched as line end, as *nix tail (and python readline in binary
mode) do. And bytes are returned. The caller can use them as is or convert
them to a string using the encoding it wants, or do whatever its
imagination can think :)

Finally, it seems to me the function is quite simple.

If all my affirmations are true, the three obstacles written by Chris
should be passed.

I'd very much like to see a CPython implementation of that function. It
could be a method of a file object opened in binary mode, and *only* in
binary mode.

What do you think about it?