Read a record instead of a line from a file

Donn Cave donn at u.washington.edu
Fri Aug 24 14:35:27 EDT 2001


Quoth "Andrew Dalke" <dalke at dalkescientific.com>:
| YMK wrote:
|> If I know the "Record Separator" of a flat file, how do I set to read
|> one record at a time ?
|
| Here's something I just tried out using 2.2's 'yield' statement.
| (2.2 is currently in alpha release.)  Warning: this is my first
| generator and I've also not fully tested it.
|
| from __future__ import generators
|
| def SepReader(infile, sep = "\n\n"):
|     text = infile.read(10000)
|     if not text:
|         return
|     while 1:
|         fields = text.split(sep)
|         for field in fields[:-1]:
|             yield field
|         text = fields[-1]
|         new_text = infile.read(10000)
|         if not new_text:
|             yield text
|             break
|         text += new_text
|
| It's used like this
|
| for record in SepReader(open(fortunes), "%\n"):
|     print record

So the generator stuff is just for fun, right?  I mean, this
can just as easily be expressed as a conventional buffer
object, minus the for loop application but I believe possibly
allowing a little more flexibility in other respects.

import sys

class SepFile:
    def __init__(self, infile, sep = "\n\n"):
        self.fp = infile
        self.sep = sep
        self.text = ''
    def readline(self):
        #  This function should eventually return '' on end of file.
        if self.text is None:
            return ''

        while 1:
            #  To include line ending in result, use find() and slice,
            #  instead of split().
            s = self.text.split(self.sep, 1)
            if len(s) > 1:
                ln, self.text = s
                return ln
            else:
                moretext = self.fp.read(10000)
                if not moretext:
                    #  Notice end of file.  Return the unterminated
                    #  data already here.  If that isn't empty, the
                    #  caller will come back for more, so set self.text
                    #  to short circuit the next read.
                    ln = self.text
                    self.text = None
                    return ln
                self.text = self.text + moretext

sf = SepFile(sys.stdin)
while 1:
    ln = sf.readline()
    if not ln:
        break
    print 'line:', repr(ln)

| If you want something that's really high speed, but uses the
| mxTextTools C extension, you can try my Martel parser, which
| is part of biopython.org.  The specific record readers are in
| http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Martel/Record
| Reader.py?cvsroot=biopython

mxTextTools rules.

	Donn Cave, donn at u.washington.edu



More information about the Python-list mailing list