file reading by record separator (not line by line)

Thu May 31 09:14:11 EDT 2007

Lee Sander wrote:

> I wanted to also say that this file is really huge, so I cannot
> just do a read() and then split on ">" to get a record
> thanks
> lee

Below is the easy solution. To get even better performance, or if '<' is not
always at the start of the line, you would have to implement the buffering
that is done by readline() yourself (see _fileobject in socket.py in the
standard lib for example).

def chunkreader(f):
    name = None
    lines = []
    while True:
        line = f.readline()
        if not line: break
        if line[0] == '>':
            if name is not None:
                yield name, lines
            name = line[1:].rstrip()
            lines = []
        else:
            lines.append(line)
    if name is not None:
        yield name, lines

if __name__ == '__main__':
    from StringIO import StringIO
    s = \
"""> name1
line1
line2
line3
> name2
line 4
line 5
line 6"""
    f = StringIO(s)
    for name, lines in chunkreader(f):
        print '***', name
        print ''.join(lines)

$ python test.py
*** name1
line1
line2
line3

*** name2
line 4
line 5
line 6

-- 

Regards,
Tijs