seek in a file
Jim Dennis
jimd at vega.starshine.org
Wed Mar 27 06:40:52 EST 2002
In article <a7qujn$n78gl$1 at ID-69142.news.dfncis.de>,
Andreas Penzel wrote:
> Hello NG!
> How to jump to a specified line in a file which is read-opened?
> With seek() I can jump to an exact position of the complete file in
> "byte-steps".
> Is seek able to jump to a line-number?
> If not, what else can I do?
>Thanks for help!
>- Andreas
This is going to sound harsh, but have you THOUGHT about
what you're asking? Close your eyes and think about how
the computer works. If a particular system *could* seek
directly by line number how *could* that possibly work?
What would the system have to *know* in order for that to
work?
You clearly know what seek() is (at least to some small
degree). seek() is a method for Python file objects, it calls
the fseek() C library function. lseek() and llseek() are Unix
(and Linux) system calls (which are generally called by
the C libraries on these systems). You've already seen that
seek() provides an interface to randomly access any byte of a
file. Perhaps you've read the man pages for fseek() and/or
lseek() (if you're on a UNIX or UNIX-like system).
But how could the OS seek to an arbitrary line number?
I can only think of three ways that this could work:
1) Lines could be of fixed length
2) Files could be indexed with the offsets of each line terminator
3) The OS could search through the file searching for line terminators
We know that the modern systems on which Python runs don't have
fixed line length constraints. You might guess the files and
filesystem metadata don't contain line terminator indexes (under
UNIX, Linux, MacOS and the various MS Windows OSes, at least). So
that pretty much leaves us with option #3.
It turns out, of course, that the OS leaves option number three to
"user space." In other words, they don't provide system calls or
low level (core) functions to search through and count line
terminators.
Of course it's true that Python *could* provide such a convenience
function. However, Python couldn't portably provide such a function
that would be any more efficient than a short function that you can
write for yourself.
The easiest answer would be simply:
f = open(yourfile,'r')
lines = f.readlines()
your_line = lines[n]
f.close()
... where n was the line number you wanted (and yourfile is a variable
containing the name of your target file, duh!). This could actually
be shortened to:
open(yourfile,'r').readlines()[n]
(all on one statement line, creating an anonymous file object,
reading ALL of its lines and returning the nth one; and implicitly
closing the file as the reference count to the anonymous file
goes to zero; because we didn't bind it to a variable name).
That would be the easy and somewhat sloppy answer. It's probably
fine for files that you can guarantee are small (that is relatively
smaller than the memory available to your Python processes).
Stylistically it's better to have three or four lines of code,
the open and close, and the readlines() method. If you're going to
access more than just one line, then you might as well store the
whole readlines() list into a variable and use that repeatedly.
If there's any chance that your file might exceed (or even approach)
your available memory then you should limit the amount size of your
readlines() method. Of course that means keeping track of a few
more details:
f.open(yourfile,'r')
curLine = 0
while 1:
lines = f.readlines(10000)
if curLine < n: curLine += len(lines)
else: break
your_line = lines[ n - curLine ]
f.close()
That should create a loop that will terminate when the "lines"
list contains line number n (your target line as in my previous
examples). The tricky part is that the target line might end up
anywhere in the "lines" buffer. So, at the end of the loop we
have to find our target line by subtracting the count of all of
the lines that we'd previously read (and implicitly discarded)
to leave us with the remaining offset into our lines buffer.
Naturally you could create your own convenience function
that did all this for you.
Of course this is predicated on the notion that your file
contains lines of reasonable length. More precisely it could
fail if you file contains any unreasonably long lines before your
target line. In practice I wouldn't worry much about this.
You could seek to the end of the file (f.seek(0,2)) and check the
size using f.tell() and then just do a "readlines()" if that's
less than some reasonable watermark (10 to 60 Mb on a late model
low-end single-user PC). It is very unlikely that longest line
length is going to be a problem.
(If it was, then you could use f.read(x) where x was your
buffer size, and then parse through that character/byte buffer
to separate it into lines. At that point you'd be well advised
to look at the StringIO module, which would allow you to treat
strings as I/O streams and perform readlines() methods on strings
as they are returned by a regular file objects read(x) method.
As I say, it's unlikely that you have to go that far; I'm just
mentioning these options to be a completist --- to explain how you
could use Python's features to overcome all adversity, even the
tragedy of insufficient memory and degenerate text file formatting).
Given your original question and my (harsh) answer you might
naturally ask why Python doesn't provide a convenience function to
search for a arbitrary line by number. I hinted at part of the
answer. The Python interpreter can't do this more efficiently
than you can (in part, at least, because the most efficient way
to accomplish this will depend on the amount of memory you have
available, and the nature of the text file that you are searching).
Another reason is that it is quite rare for people to need access
to specific line (by number) in text files. If a given application
requires such access then it would normally be accomplished by
employing some form of indexing or by using fixed record lengths
(which might still each be contained on a single line, but that is
beside the point).
So it seems pretty obvious why Python (and other programming
languages) don't offer the function that you were asking for.
More information about the Python-list
mailing list