file tell in a for-loop

Tim Chase python.list at tim.thechases.com
Wed Nov 19 05:57:46 EST 2008


Magdoll wrote:
> I was trying to map various locations in a file to a dictionary. At
> first I read through the file using a for-loop, but tell() gave back
> weird results, so I switched to while, then it worked.
> 
> The for-loop version was something like:
>                 d = {}
>                 for line in f:
>                          if line.startswith('>'): d[line] = f.tell()
> 
> And the while version was:
>                 d = {}
>                 while 1:
>                         line = f.readline()
>                         if len(line) == 0: break
>                         if line.startswith('>'): d[line] = f.tell()
> 
> 
> In the for-loop version, f.tell() would sometimes return the same
> result multiple times consecutively, even though the for-loop
> apparently progressed the file descriptor. I don't have a clue why
> this happened, but I switched to while loop and then it worked.
> 
> Does anyone have any ideas as to why this is so?

I suspect that at least the iterator version uses internal 
buffering, so the tell() call returns the current buffer 
read-location, not the current read location.  I've also had 
problems with tell() returning bogus results while reading 
through large non-binary files (in this case about a 530 meg 
text-file) once the file-offset passed some point I wasn't able 
to identify.  It may have to do with newline translation as this 
was python2.4 on Win32.  Switching to "b"inary mode resolved the 
issue for me.

I created the following generator to make my life a little easier:

   def offset_iter(fp):
     assert 'b' in fp.mode.lower(), \
       "offset_iter must have a binary file"
     while True:
       addr = fp.tell()
       line = fp.readline()
       if not line: break
       yield (addr, line.rstrip('\n\r'))

That way, I can just use

   f = file('foo.txt', 'b')
   for offset, line in offset_iter(f):
     if line.startswith('>'): d[line] = offset

This bookmarks the *beginning* (I think your code notes the 
*end*) of each line that starts with ">"

-tkc








More information about the Python-list mailing list