tail

Marco Sulla Marco.Sulla.Python at gmail.com
Sun May 8 16:48:32 EDT 2022


On Sun, 8 May 2022 at 22:34, Barry <barry at barrys-emacs.org> wrote:
>
> > On 8 May 2022, at 20:48, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> >
> > On Sun, 8 May 2022 at 20:31, Barry Scott <barry at barrys-emacs.org> wrote:
> >>
> >>>> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> >>>
> >>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >>>   n_chunk_size = n * chunk_size
> >>
> >> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
> >> I tend to read on multiple of MiB as its near instant.
> >
> > Well, I tested on a little file, a list of my preferred pizzas, so....
>
> Try it on a very big file.

I'm not saying it's a good idea, it's only the value that I needed for my tests.
Anyway, it's not a problem with big files. The problem is with files
with long lines.

> >> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.
> >
> > Why? I don't see any recommendation about it in the docs:
> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>
> What does adding 1 to a pos mean?
> If it’s binary it mean 1 byte further down the file but in text mode it may need to
> move the point 1, 2 or 3 bytes down the file.

Emh. I re-quote

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

And so on. No mention of differences between text and binary mode.

> >> You have on limit on the amount of data read.
> >
> > I explained that previously. Anyway, chunk_size is small, so it's not
> > a great problem.
>
> Typo I meant you have no limit.
>
> You read all the data till the end of the file that might be mega bytes of data.

Yes, I already explained why and how it could be optimized. I quote myself:

Shortly, the file is always opened in text mode. File is read at the
end in bigger and bigger chunks, until the file is finished or all the
lines are found.

Why? Because in encodings that have more than 1 byte per character,
reading a chunk of n bytes, then reading the previous chunk, can
eventually split the character between the chunks in two distinct
bytes.

I think one can read chunk by chunk and test the chunk junction
problem. I suppose the code will be faster this way. Anyway, it seems
that this trick is quite fast anyway and it's a lot simpler.


More information about the Python-list mailing list