Python garbage collector/memory manager behaving strangely

alex23 wuwei23 at gmail.com
Sun Sep 16 23:25:06 EDT 2012


On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad... at credit-suisse.com>
wrote:
> - As you have seen, the line separator is not '\n' but its '|\n'.
> Sometimes the data itself has '\n' characters in the middle of the line
> and only way to find true end of the line is that previous character
> should be a bar '|'. I was not able specify end of line using
> readlines() function, but I could do it using split() function.
> (One hack would be to readlines and combine them until I find '|\n'. is
> there a cleaner way to do this?)

You can use a generator to take care of your readlines requirements:

    def readlines(f):
        lines = []
        while "f is not empty":
            line = f.readline()
            if not line: break
            if len(line) > 2 and line[-2:] == '|\n':
                lines.append(line)
                yield ''.join(lines)
                lines = []
            else:
                lines.append(line)

> - Reading whole file at once and processing line by line was must
> faster. Though speed is not of very important issue here but I think the
> tie it took to parse complete file was reduced to one third of original
> time.

With the readlines generator above, it'll read lines from the file
until it has a complete "line" by your requirement, at which point
it'll yield it. If you don't need the entire file in memory for the
end result, you'll be able to process each "line" one at a time and
perform whatever you need against it before asking for the next.

    with open(u'infile.txt','r') as infile:
        for line in readlines(infile):
            ...

Generators are a very efficient way of processing large amounts of
data. You can chain them together very easily:

    real_lines = readlines(infile)
    marker_lines = (l for l in real_lines if l.startswith('#'))
    every_second_marker = (l for i,l in enumerate(marker_lines) if (i
+1) % 2 == 0)
    map(some_function, every_second_marker)

The real_lines generator returns your definition of a line. The
marker_lines generator filters out everything that doesn't start with
#, while every_second_marker returns only half of those. (Yes, these
could all be written as a single generator, but this is very useful
for more complex pipelines).

The big advantage of this approach is that nothing is read from the
file into memory until map is called, and given the way they're
chained together, only one of your lines should be in memory at any
given time.



More information about the Python-list mailing list