tail

Barry Scott barry at barrys-emacs.org
Sun May 8 14:31:45 EDT 2022



> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> 
> I think I've _almost_ found a simpler, general way:
> 
> import os
> 
> _lf = "\n"
> _cr = "\r"
> 
> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>    n_chunk_size = n * chunk_size

Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
I tend to read on multiple of MiB as its near instant.

>    pos = os.stat(filepath).st_size

You cannot mix POSIX API with text mode.
pos is in bytes from the start of the file.
Textmode will be in code points. bytes != code points.

>    chunk_line_pos = -1
>    lines_not_found = n
> 
>    with open(filepath, newline=newline, encoding=encoding) as f:
>        text = ""
> 
>        hard_mode = False
> 
>        if newline == None:
>            newline = _lf
>        elif newline == "":
>            hard_mode = True
> 
>        if hard_mode:
>            while pos != 0:
>                pos -= n_chunk_size
> 
>                if pos < 0:
>                    pos = 0
> 
>                f.seek(pos)

In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.

>                text = f.read()

You have on limit on the amount of data read.

>                lf_after = False
> 
>                for i, char in enumerate(reversed(text)):

Simple use text.rindex('\n') or text.rfind('\n') for speed.

>                    if char == _lf:
>                        lf_after == True
>                    elif char == _cr:
>                        lines_not_found -= 1
> 
>                        newline_size = 2 if lf_after else 1
> 
>                        lf_after = False
>                    elif lf_after:
>                        lines_not_found -= 1
>                        newline_size = 1
>                        lf_after = False
> 
> 
>                    if lines_not_found == 0:
>                        chunk_line_pos = len(text) - 1 - i + newline_size
>                        break
> 
>                if lines_not_found == 0:
>                    break
>        else:
>            while pos != 0:
>                pos -= n_chunk_size
> 
>                if pos < 0:
>                    pos = 0
> 
>                f.seek(pos)
>                text = f.read()
> 
>                for i, char in enumerate(reversed(text)):
>                    if char == newline:
>                        lines_not_found -= 1
> 
>                        if lines_not_found == 0:
>                            chunk_line_pos = len(text) - 1 - i +
> len(newline)
>                            break
> 
>                if lines_not_found == 0:
>                    break
> 
> 
>    if chunk_line_pos == -1:
>        chunk_line_pos = 0
> 
>    return text[chunk_line_pos:]
> 
> 
> Shortly, the file is always opened in text mode. File is read at the end in
> bigger and bigger chunks, until the file is finished or all the lines are
> found.

It will fail if the contents is not ASCII.

> 
> Why? Because in encodings that have more than 1 byte per character, reading
> a chunk of n bytes, then reading the previous chunk, can eventually split
> the character between the chunks in two distinct bytes.

No it cannot. text mode only knows how to return code points. Now if you are in
binary it could be split, but you are not in binary mode so it cannot.

> I think one can read chunk by chunk and test the chunk junction problem. I
> suppose the code will be faster this way. Anyway, it seems that this trick
> is quite fast anyway and it's a lot simpler.

> The final result is read from the chunk, and not from the file, so there's
> no problems of misalignment of bytes and text. Furthermore, the builtin
> encoding parameter is used, so this should work with all the encodings
> (untested).
> 
> Furthermore, a newline parameter can be specified, as in open(). If it's
> equal to the empty string, the things are a little more complicated, anyway
> I suppose the code is clear. It's untested too. I only tested with an utf8
> linux file.
> 
> Do you think there are chances to get this function as a method of the file
> object in CPython? The method for a file object opened in bytes mode is
> simpler, since there's no encoding and newline is only \n in that case.

State your requirements. Then see if your implementation meets them.

Barry

> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 



More information about the Python-list mailing list