efficient 'tail' implementation

Fri Dec 9 05:53:30 EST 2005

On Thu, 08 Dec 2005 02:09:58 -0500, Mike Meyer <mwm at mired.org> wrote:

>s99999999s2003 at yahoo.com writes:
>> I have a file which is very large eg over 200Mb , and i am going to use
>> python to code  a "tail"
>> command to get the last few lines of the file. What is a good algorithm
>> for this type of task in python for very big files?
>> Initially, i thought of reading everything into an array from the file
>> and just get the last few elements (lines) but since it's a very big
>> file, don't think is efficient. 
>
>Well, 200mb isn't all that big these days. But it's easy to code:
>
># untested code
>input = open(filename)
>tail = input.readlines()[:tailcount]
>input.close()
>
>and you're done. However, it will go through a lot of memory. Fastest
>is probably working through it backwards, but that may take multiple
>tries to get everything you want:
>
># untested code
>input = open(filename)
>blocksize = tailcount * expected_line_length
>tail = []
>while len(tail) < tailcount:
>      input.seek(-blocksize, EOF)
>      tail = input.read().split('\n')
>      blocksize *= 2
>input.close()
>tail = tail[:tailcount]
>
>It would probably be more efficient to read blocks backwards and paste
>them together, but I'm not going to get into that.
>
Ok, I'll have a go (only tested slightly ;-)

 >>> def frsplit(fname, nitems=10, splitter='\n', chunk=8192):
 ...     f = open(fname, 'rb')
 ...     f.seek(0, 2)
 ...     bufpos = f.tell() # pos from file beg == file length
 ...     buf = ['']
 ...     for nl in xrange(nitems):
 ...         while len(buf)<2:
 ...             chunk = min(chunk, bufpos)
 ...             bufpos = bufpos-chunk
 ...             f.seek(bufpos)
 ...             buf = (f.read(chunk)+buf[0]).split(splitter)
 ...             if buf== ['']: break
 ...             if bufpos==0: break
 ...         if len(buf)>1: yield buf.pop(); continue
 ...         if bufpos==0: yield buf.pop(); break
 ...

20 lines from the tail of november's python-dev archive

 >>> print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', 20))))
 lives in the mimelib project's hidden CVS on SF, but that seems pretty
 silly.

 Basically I'm just going to add the test script, setup.py, generated
 html docs and a few additional unit tests, along with svn:external refs
 to pull in Lib/email from the appropriate Python svn tree.  This way,
 I'll be able to create standalone email packages from the sandbox (which
 I need to do because I plan on fixing a few outstanding email bugs).

 -Barry

 -------------- next part --------------
 A non-text attachment was scrubbed...
 Name: not available
 Type: application/pgp-signature
 Size: 307 bytes
 Desc: This is a digitally signed message part
 Url : http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp

Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.

Regards,
Bengt Richter