efficient 'tail' implementation
Bengt Richter
bokr at oz.net
Fri Dec 9 05:53:30 EST 2005
On Thu, 08 Dec 2005 02:09:58 -0500, Mike Meyer <mwm at mired.org> wrote:
>s99999999s2003 at yahoo.com writes:
>> I have a file which is very large eg over 200Mb , and i am going to use
>> python to code a "tail"
>> command to get the last few lines of the file. What is a good algorithm
>> for this type of task in python for very big files?
>> Initially, i thought of reading everything into an array from the file
>> and just get the last few elements (lines) but since it's a very big
>> file, don't think is efficient.
>
>Well, 200mb isn't all that big these days. But it's easy to code:
>
># untested code
>input = open(filename)
>tail = input.readlines()[:tailcount]
>input.close()
>
>and you're done. However, it will go through a lot of memory. Fastest
>is probably working through it backwards, but that may take multiple
>tries to get everything you want:
>
># untested code
>input = open(filename)
>blocksize = tailcount * expected_line_length
>tail = []
>while len(tail) < tailcount:
> input.seek(-blocksize, EOF)
> tail = input.read().split('\n')
> blocksize *= 2
>input.close()
>tail = tail[:tailcount]
>
>It would probably be more efficient to read blocks backwards and paste
>them together, but I'm not going to get into that.
>
Ok, I'll have a go (only tested slightly ;-)
>>> def frsplit(fname, nitems=10, splitter='\n', chunk=8192):
... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...
20 lines from the tail of november's python-dev archive
>>> print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', 20))))
lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.
Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp
Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.
Regards,
Bengt Richter
More information about the Python-list
mailing list