Python garbage collector/memory manager behaving strangely

Mon Sep 17 00:39:05 EDT 2012

alex23於 2012年9月17日星期一UTC+8上午11時25分06秒寫道：
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad... at credit-suisse.com>
> 
> wrote:
> 
> > - As you have seen, the line separator is not '\n' but its '|\n'.
> 
> > Sometimes the data itself has '\n' characters in the middle of the line
> 
> > and only way to find true end of the line is that previous character
> 
> > should be a bar '|'. I was not able specify end of line using
> 
> > readlines() function, but I could do it using split() function.
> 
> > (One hack would be to readlines and combine them until I find '|\n'. is
> 
> > there a cleaner way to do this?)
> 
> 
> 
> You can use a generator to take care of your readlines requirements:
> 
> 
> 
>     def readlines(f):
> 
>         lines = []
> 
>         while "f is not empty":
> 
>             line = f.readline()
> 
>             if not line: break
> 
>             if len(line) > 2 and line[-2:] == '|\n':
> 
>                 lines.append(line)
> 
>                 yield ''.join(lines)
> 
>                 lines = []
> 
>             else:
> 
>                 lines.append(line)
> 
> 
> 
> > - Reading whole file at once and processing line by line was must
> 
> > faster. Though speed is not of very important issue here but I think the
> 
> > tie it took to parse complete file was reduced to one third of original
> 
> > time.
> 
> 
> 
> With the readlines generator above, it'll read lines from the file
> 
> until it has a complete "line" by your requirement, at which point
> 
> it'll yield it. If you don't need the entire file in memory for the
> 
> end result, you'll be able to process each "line" one at a time and
> 
> perform whatever you need against it before asking for the next.
> 
> 
> 
>     with open(u'infile.txt','r') as infile:
> 
>         for line in readlines(infile):
> 
>             ...
> 
> 
> 
> Generators are a very efficient way of processing large amounts of
> 
> data. You can chain them together very easily:
> 
> 
> 
>     real_lines = readlines(infile)
> 
>     marker_lines = (l for l in real_lines if l.startswith('#'))
> 
>     every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> 
> +1) % 2 == 0)
> 
>     map(some_function, every_second_marker)
> 
> 
> 
> The real_lines generator returns your definition of a line. The
> 
> marker_lines generator filters out everything that doesn't start with
> 
> #, while every_second_marker returns only half of those. (Yes, these
> 
> could all be written as a single generator, but this is very useful
> 
> for more complex pipelines).
> 
> 
> 
> The big advantage of this approach is that nothing is read from the
> 
> file into memory until map is called, and given the way they're
> 
> chained together, only one of your lines should be in memory at any
> 
> given time.

The basic problem is whether the output items really need 
all lines of the input text file to be buffered to 
produce the results.