tail

Sun May 8 16:34:04 EDT 2022

> On 8 May 2022, at 20:48, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> 
> On Sun, 8 May 2022 at 20:31, Barry Scott <barry at barrys-emacs.org> wrote:
>> 
>>>> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
>>> 
>>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>>>   n_chunk_size = n * chunk_size
>> 
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
>> I tend to read on multiple of MiB as its near instant.
> 
> Well, I tested on a little file, a list of my preferred pizzas, so....

Try it on a very big file.

> 
>>>   pos = os.stat(filepath).st_size
>> 
>> You cannot mix POSIX API with text mode.
>> pos is in bytes from the start of the file.
>> Textmode will be in code points. bytes != code points.
>> 
>>>   chunk_line_pos = -1
>>>   lines_not_found = n
>>> 
>>>   with open(filepath, newline=newline, encoding=encoding) as f:
>>>       text = ""
>>> 
>>>       hard_mode = False
>>> 
>>>       if newline == None:
>>>           newline = _lf
>>>       elif newline == "":
>>>           hard_mode = True
>>> 
>>>       if hard_mode:
>>>           while pos != 0:
>>>               pos -= n_chunk_size
>>> 
>>>               if pos < 0:
>>>                   pos = 0
>>> 
>>>               f.seek(pos)
>> 
>> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.
> 
> Why? I don't see any recommendation about it in the docs:
> https://docs.python.org/3/library/io.html#io.IOBase.seek

What does adding 1 to a pos mean?
If it’s binary it mean 1 byte further down the file but in text mode it may need to
move the point 1, 2 or 3 bytes down the file.

> 
>>>               text = f.read()
>> 
>> You have on limit on the amount of data read.
> 
> I explained that previously. Anyway, chunk_size is small, so it's not
> a great problem.

Typo I meant you have no limit.

You read all the data till the end of the file that might be mega bytes of data.
> 
>>>               lf_after = False
>>> 
>>>               for i, char in enumerate(reversed(text)):
>> 
>> Simple use text.rindex('\n') or text.rfind('\n') for speed.
> 
> I can't use them when I have to find both \n or \r. So I preferred to
> simplify the code and use the for cycle every time. Take into mind
> anyway that this is a prototype for a Python C Api implementation
> (builtin I hope, or a C extension if not)
> 
>>> Shortly, the file is always opened in text mode. File is read at the end in
>>> bigger and bigger chunks, until the file is finished or all the lines are
>>> found.
>> 
>> It will fail if the contents is not ASCII.
> 
> Why?
> 
>>> Why? Because in encodings that have more than 1 byte per character, reading
>>> a chunk of n bytes, then reading the previous chunk, can eventually split
>>> the character between the chunks in two distinct bytes.
>> 
>> No it cannot. text mode only knows how to return code points. Now if you are in
>> binary it could be split, but you are not in binary mode so it cannot.
> 
>> From the docs:
> 
> seek(offset, whence=SEEK_SET)
> Change the stream position to the given byte offset.
> 
>>> Do you think there are chances to get this function as a method of the file
>>> object in CPython? The method for a file object opened in bytes mode is
>>> simpler, since there's no encoding and newline is only \n in that case.
>> 
>> State your requirements. Then see if your implementation meets them.
> 
> The method should return the last n lines from a file object.
> If the file object is in text mode, the newline parameter must be honored.
> If the file object is in binary mode, a newline is always b"\n", to be
> consistent with readline.
> 
> I suppose the current implementation of tail satisfies the
> requirements for text mode. The previous one satisfied binary mode.
> 
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.
>