itertools, functools, file enhancement ideas

Alex Martelli aleax at mac.com
Sat Apr 7 20:00:42 EDT 2007


Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote:

> I just had to write some programs that crunched a lot of large files,
> both text and binary.  As I use iterators more I find myself wishing
> for some maybe-obvious enhancements:
> 
> 1. File iterator for blocks of chars:
> 
>        f = open('foo')
>        for block in f.iterchars(n=1024):  ...
> 
> iterates through 1024-character blocks from the file.  The default iterator
> which loops through lines is not always a good choice since each line can
> use an unbounded amount of memory.  Default n in the above should be 1 char.

the simple way (letting the file object deal w/buffering issues):

def iterchars(f, n=1):
    while True:
        x = f.read(n)
        if not x: break
        yield x

the fancy way (doing your own buffering) is left as an exercise for the
reader.  I do agree it would be nice to have in some module.


> 2. wrapped file openers:
>     There should be functions (either in itertools, builtins, the sys
>     module, or whereever) that open a file, expose one of the above
>     iterators, then close the file, i.e.
>        def file_lines(filename):
>          with f as open(filename):
>            for line in f:
>              yield line
>     so you can say
> 
>        for line in file_lines(filename):  
>            crunch(line)
> 
> The current bogus idiom is to say "for line in open(filename)" but 
> that does not promise to close the file once the file is exhausted
> (part of the motivation of the new "with" statement).  There should
> similarly be "file_chars" which uses the n-chars iterator instead of
> the line iterator.

I'm +/-0 on this one vs the idioms:

with open(filename) as f:
    for line in f: crunch(line)

with open(filename, 'rb') as f:
    for block in iterchars(f): crunch(block)

Making two lines into one is a weak use case for a stdlib function.


> 3. itertools.ichain:
>    yields the contents of each of a sequence of iterators, i.e.:
>      def ichain(seq):
>          for s in seq:
>              for t in s:
>                 yield t
>    this is different from itertools.chain because it lazy-evaluates its
>    input sequence.  Example application:
> 
>       all_filenames = ['file1', 'file2', 'file3']
>       # loop through all the files crunching all lines in each one
>       for line in (ichain(file_lines(x) for x in all_filenames)):
>          crunch(x)

Yes, subtle but important distinction.


> 4. functools enhancements (Haskell-inspired):
>    Let f be a function with 2 inputs.  Then:
>       a) def flip(f): return lambda x,y: f(y,x)
>       b) def lsect(x,f): return partial(f,x)
>       c) def rsect(f,x): return partial(flip(f), x)
> 
>    lsect and rsect allow making what Haskell calls "sections".  Example:
>       # sequence of all squares less than 100
>       from operator import lt
>       s100 = takewhile(rsect(lt, 100), (x*x for x in count()))

Looks like they'd be useful, but I'm not sure about limiting them to
working with 2-argument functions only.


Alex



More information about the Python-list mailing list