building an index for large text files for fast access

Simon Forman rogue_pedro at yahoo.com
Wed Jul 26 00:18:01 EDT 2006


Yi Xing wrote:
> Hi,
>
> I need to read specific lines of huge text files. Each time, I know
> exactly which line(s) I want to read. readlines() or readline() in a
> loop is just too slow. Since different lines have different size, I
> cannot use seek(). So I am thinking of building an index for the file
> for fast access. Can anybody give me some tips on how to do this in
> Python? Thanks.
>
> Yi

I had to do this for some large log files.  I wrote one simple script
to generate the index file and another that used the index file to read
lines from the log file.  Here are (slightly cleaned up for clarity)
the two scripts.  (Note that they'll only work with files less than
4,294,967,296 bytes long..  If your files are larger than that
substitute 'Q' for 'L' in the struct formats.)

First, genoffsets.py
#!/usr/bin/env python
'''
Write the byte offset of each line.
'''
import fileinput
import struct
import sys

def f(n): return struct.pack('L', n)

def main():
    total = 0

    # Main processing..
    for n, line in enumerate(fileinput.input()):

        sys.stdout.write(f(total))

        total += len(line)

        # Status output.
        if not n % 1000:
            print >> sys.stderr, '%i lines processed' % n

    print >> sys.stderr, '%i lines processed' % (n + 1)


if __name__ == '__main__':
    main()


You use it (on linux) like so:
cat large_file | ./genoffsets.py > index.dat

And here's the getline.py script:
#!/usr/bin/env python
'''
Usage: "getline.py <datafile> <indexfile> <num>"

Prints line num from datafile using indexfile.
'''
import struct
import sys

fmt = 'L'
fmt_size = struct.calcsize(fmt)


def F(n, fn):
    '''
    Return the byte offset of line n from index file fn.
    '''
    f = open(fn)

    try:
        f.seek(n * fmt_size)
        data = f.read(fmt_size)
    finally:
        f.close()

    return struct.unpack(fmt, data)[0]


def getline(n, data_file, index_file):
    '''
    Return line n from data file using index file.
    '''
    n = F(n, index_file)
    f = open(data_file)

    try:
        f.seek(n)
        data = f.readline()
    finally:
        f.close()

    return data


if __name__ == '__main__':
    dfn, ifn, lineno = sys.argv[-3:]
    n = int(lineno)
    print getline(n, dfn, ifn)



Hope this helps,
~Simon




More information about the Python-list mailing list