[Tutor] file question [line # lookup / anydbm]

Tue Aug 5 16:50:02 EDT 2003

On Tue, 5 Aug 2003 jpollack@socrates.Berkeley.EDU wrote:

> Hi everyone,
>
> Is there a way to pull a specific line from a file without reading the
> whole thing into memory with .readlines()?
>
> I have a monstrous text file, but I can easily figure out the index
> numbers of the particular line I need without having to read the whole
> thing in?  Is there a quick way to do this?
>
> I couldn't find anything in documentation specifically addressing this.

Hi Joshua,

A "preprocessing" approach here might work --- we can write a program that
takes your large text file, and transforms it into something that's easier
to search through.

One way we can preprocess the text file is to use the 'anydbm' library:

    http://www.python.org/doc/lib/module-anydbm.html

Anydbm gives us an object that, for most purposes, acts like a dictionary,
except it's stored on disk.  One way we can approach this problem might be
something like this:

###
from __future__ import generators
import anydbm

def createIndex(filename, index_filename):
    """Creates a new index, keyed by line number, of the given
    filename."""
    index = anydbm.open(index_filename, "c")
    for line_number, line in enumerate(open(filename)):
        index[str(line_number)] = line     ## keys can only be strings
                                           ## though...
    index.close()

def lookupLine(index, line_number):
    return index[str(line_number)]

def enumerate(sequence):
    """Compatibility function for Python < 2.3.  In Python 2.3, this is
    a builtin."""
    i = 0
    for x in sequence:
        yield i, x
        i = i + 1
###

Let's see if this works:

###
>>> createIndex("/usr/share/dict/words", 'index')
>>> index = anydbm.open('index')
>>> lookupLine(index, 0)
'A\n'
>>> lookupLine(index, 500)
'absconder\n'
>>> lookupLine(index, 42)
'abalone\n'
>>> lookupLine(index, 1999)
'actinomere\n'
>>> lookupLine(index, 2999)
'advisably\n'
>>> lookupLine(index, 10999)
'Aonian\n'
>>> lookupLine(index, 70999)
'floriculturally\n'
###

And now it's very easy to do arbitrary line lookup.  So the preprocessing
step --- creating the index --- can be done once, and once it's created,
we can do all our access with the dbm file.

The disadvantage of this approach is that the index file itself can become
quite large.  We can modify our approach so that it doesn't waste so much
space --- instead of saving the lines in our anydbm file, we can store the
byte positions where those lines appear.

But perhaps this is a bit overkill?  *grin*  Will this dbm approach work
ok for you?

Good luck!