Python garbage collector/memory manager behaving strangely

88888 Dihedral dihedral88888 at googlemail.com
Mon Sep 17 00:39:05 EDT 2012


alex23於 2012年9月17日星期一UTC+8上午11時25分06秒寫道:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad... at credit-suisse.com>
> 
> wrote:
> 
> > - As you have seen, the line separator is not '\n' but its '|\n'.
> 
> > Sometimes the data itself has '\n' characters in the middle of the line
> 
> > and only way to find true end of the line is that previous character
> 
> > should be a bar '|'. I was not able specify end of line using
> 
> > readlines() function, but I could do it using split() function.
> 
> > (One hack would be to readlines and combine them until I find '|\n'. is
> 
> > there a cleaner way to do this?)
> 
> 
> 
> You can use a generator to take care of your readlines requirements:
> 
> 
> 
>     def readlines(f):
> 
>         lines = []
> 
>         while "f is not empty":
> 
>             line = f.readline()
> 
>             if not line: break
> 
>             if len(line) > 2 and line[-2:] == '|\n':
> 
>                 lines.append(line)
> 
>                 yield ''.join(lines)
> 
>                 lines = []
> 
>             else:
> 
>                 lines.append(line)
> 
> 
> 
> > - Reading whole file at once and processing line by line was must
> 
> > faster. Though speed is not of very important issue here but I think the
> 
> > tie it took to parse complete file was reduced to one third of original
> 
> > time.
> 
> 
> 
> With the readlines generator above, it'll read lines from the file
> 
> until it has a complete "line" by your requirement, at which point
> 
> it'll yield it. If you don't need the entire file in memory for the
> 
> end result, you'll be able to process each "line" one at a time and
> 
> perform whatever you need against it before asking for the next.
> 
> 
> 
>     with open(u'infile.txt','r') as infile:
> 
>         for line in readlines(infile):
> 
>             ...
> 
> 
> 
> Generators are a very efficient way of processing large amounts of
> 
> data. You can chain them together very easily:
> 
> 
> 
>     real_lines = readlines(infile)
> 
>     marker_lines = (l for l in real_lines if l.startswith('#'))
> 
>     every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> 
> +1) % 2 == 0)
> 
>     map(some_function, every_second_marker)
> 
> 
> 
> The real_lines generator returns your definition of a line. The
> 
> marker_lines generator filters out everything that doesn't start with
> 
> #, while every_second_marker returns only half of those. (Yes, these
> 
> could all be written as a single generator, but this is very useful
> 
> for more complex pipelines).
> 
> 
> 
> The big advantage of this approach is that nothing is read from the
> 
> file into memory until map is called, and given the way they're
> 
> chained together, only one of your lines should be in memory at any
> 
> given time.

The basic problem is whether the output items really need 
all lines of the input text file to be buffered to 
produce the results.





More information about the Python-list mailing list