seek in a file

Jim Dennis jimd at vega.starshine.org
Wed Mar 27 06:40:52 EST 2002


In article <a7qujn$n78gl$1 at ID-69142.news.dfncis.de>, 
	Andreas Penzel wrote:

> Hello NG!

> How to jump to a specified line in a file which is read-opened?
> With seek() I can jump to an exact position of  the complete file in
> "byte-steps".
> Is seek able to jump to a line-number?
> If not, what else can I do?

>Thanks for help!

>- Andreas

 This is going to sound harsh, but have you THOUGHT about 
 what you're asking?  Close your eyes and think about how
 the computer works.  If a particular system *could* seek
 directly by line number how *could* that possibly work?
 What would the system have to *know* in order for that to
 work?

 You clearly know what seek() is (at least to some small
 degree).  seek() is a method for Python file objects, it calls
 the fseek() C library function.  lseek() and llseek() are Unix 
 (and Linux) system calls (which are generally called by 
 the C libraries on these systems).  You've already seen that
 seek() provides an interface to randomly access any byte of a 
 file.  Perhaps you've read the man pages for fseek() and/or 
 lseek() (if you're on a UNIX or UNIX-like system). 

 But how could the OS seek to an arbitrary line number?  
 I can only think of three ways that this could work:

  1) Lines could be of fixed length
  2) Files could be indexed with the offsets of each line terminator
  3) The OS could search through the file searching for line terminators

 We know that the modern systems on which Python runs don't have
 fixed line length constraints.  You might guess the files and 
 filesystem metadata don't contain line terminator indexes (under
 UNIX, Linux, MacOS and the various MS Windows OSes, at least). So
 that pretty much leaves us with option #3.

 It turns out, of course, that the OS leaves option number three to
 "user space." In other words, they don't provide system calls or
 low level (core) functions to search through and count line
 terminators.

 Of course it's true that Python *could* provide such a convenience
 function.  However, Python couldn't portably provide such a function
 that would be any more efficient than a short function that you can
 write for yourself.

 The easiest answer would be simply:

 	f = open(yourfile,'r')
	lines = f.readlines()
	your_line = lines[n]
	f.close()

 ... where n was the line number you wanted (and yourfile is a variable
 containing the name of your target file, duh!).  This could actually 
 be shortened to:

 	open(yourfile,'r').readlines()[n]

 (all on one statement line, creating an anonymous file object,
 reading ALL of its lines and returning the nth one; and implicitly
 closing the file as the reference count to the anonymous file
 goes to zero; because we didn't bind it to a variable name).

 That would be the easy and somewhat sloppy answer.  It's probably
 fine for files that you can guarantee are small (that is relatively
 smaller than the memory available to your Python processes).
 Stylistically it's better to have three or four lines of code,
 the open and close, and the readlines() method.  If you're going to
 access more than just one line, then you might as well store the
 whole readlines() list into a variable and use that repeatedly.

 If there's any chance that your file might exceed (or even approach)
 your available memory then you should limit the amount size of your 
 readlines() method.  Of course that means keeping track of a few
 more details:

 	f.open(yourfile,'r')
	curLine = 0
	while 1:
		lines = f.readlines(10000)
		if curLine < n: 	curLine += len(lines)
		else: 				break
	your_line = lines[ n - curLine ]
	f.close()

 That should create a loop that will terminate when the "lines" 
 list contains line number n (your target line as in my previous
 examples).  The tricky part is that the target line might end up
 anywhere in the "lines" buffer.  So, at the end of the loop we
 have to find our target line by subtracting the count of all of 
 the lines that we'd previously read (and implicitly discarded) 
 to leave us with the remaining offset into our lines buffer.

 Naturally you could create your own convenience function 
 that did all this for you.

 Of course this is predicated on the notion that your file
 contains lines of reasonable length.  More precisely it could 
 fail if you file contains any unreasonably long lines before your
 target line.  In practice I wouldn't worry much about this.
 You could seek to the end of the file (f.seek(0,2)) and check the
 size using f.tell() and then just do a "readlines()" if that's
 less than some reasonable watermark (10 to 60 Mb on a late model
 low-end single-user PC).  It is very unlikely that longest line
 length is going to be a problem.

 (If it was, then you could use f.read(x) where x was your 
 buffer size, and then parse through that character/byte buffer
 to separate it into lines.  At that point you'd be well advised
 to look at the StringIO module, which would allow you to treat 
 strings as I/O streams and perform readlines() methods on strings
 as they are returned by a regular file objects read(x) method.
 As I say, it's unlikely that you have to go that far; I'm just 
 mentioning these options to be a completist --- to explain how you
 could use Python's features to overcome all adversity, even the
 tragedy of insufficient memory and degenerate text file formatting).

 Given your original question and my (harsh) answer you might 
 naturally ask why Python doesn't provide a convenience function to
 search for a arbitrary line by number.  I hinted at part of the 
 answer.  The Python interpreter can't do this more efficiently 
 than you can (in part, at least, because the most efficient way
 to accomplish this will depend on the amount of memory you have
 available, and the nature of the text file that you are searching).

 Another reason is that it is quite rare for people to need access
 to specific line (by number) in text files.  If a given application
 requires such access then it would normally be accomplished by
 employing some form of indexing or by using fixed record lengths
 (which might still each be contained on a single line, but that is 
 beside the point).

 So it seems pretty obvious why Python (and other programming 
 languages) don't offer the function that you were asking for.




More information about the Python-list mailing list