Into itertools

Sun Apr 26 12:32:32 EDT 2009

Some idioms are so common that I think they deserve to be written in C
into the itertools module.

1) leniter(iterator)

It returns the length of a given iterator, consuming it, without
creating a list. I have discussed this twice in the past.
Like itertools.izip_longest don't use it with infinite iterators.

The following code is faster than
sum(1 for _ in iterator)

A Python implementation:

def leniter(iterator):
    if hasattr(iterator, "__len__"):
        return len(iterator)
    nelements = 0
    for _ in iterator:
        nelements += 1
    return nelements

I use it to count the grup items that groupby() gives:

[(h, leniter(g)) for h,g in groupby(iterable)]

And I use it various other situations, like when I want to test the
speed of a lazy generator:

timeit(leniter, xpairwise(xrange(20)))

(Where "timeit" is a function of mine similar to a Mathematica
function that given a function and some data, applies it and returns
the result and the time taken to perform it).

----------------

2) xpairwise(iterable)

This is at the the bottom of the itertools docs:

def pairwise(iterable):
    a, b = tee(iterable)
    for elem in b:
        break
    return izip(a, b)

The following of mine is faster:

def xpairwise(iterable):
    return izip(iterable, islice(iterable, 1, None))

But a C implementation is better (also avoiding copy&paste from the
bottom of the docs page).

This function can be generalized a lot (I also have written a much
more flexible version), but in practice I have seen this basic version
covers most of the common situations.

----------------

3) xpairs(seq)

This is a Python implementation:

def xpairs(seq):
    """xpairs(seq): yields all the distinct pairs of the given seq,
inside a pair
    the order doesn't matter. No pairs have the same element twice.
    This means it yields just the upper triangular half of the pair
matrix.

    >>> list( xpairs([]) )
    []
    >>> list(xpairs([1,2]))
    [(1, 2)]
    >>> list( xpairs("abc") )
    [('a', 'b'), ('a', 'c'), ('b', 'c')]
    >>> list(xpairs(range(5)))
    [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3),
(2, 4), (3, 4)]
    """
    len_seq = len(seq)
    for i, e1 in enumerate(seq):
        for j in xrange(i+1, len_seq):
            yield e1, seq[j]

A more general implementation can also be written when seq isn't
randomly-addressable, but I think that may become not efficient.

IHMO xpairs and xpairwise may even deserve an interpreter support, to
speed up them when possible. In a modern compiled language (like
Chapel, http://chapel.cs.washington.edu/ ) such idioms (in theory)
produce no slowdown compared to normal loops. I have written generic
xpairs() and xpairwise() for the D language too, but the compiler
isn't able to fully optimize away such constructs.

----------------

4) xsubsets(seq)

This is the recipe at the bottom of the docs page (a bit modified, now
it returns lists because in most situations I don't want the results
to be sets):

def subsets(iterable):
    # Recipe credited to Eric Raymond
    pairs = [(2 ** i, x) for i, x in enumerate(iterable)]
    for n in xrange(2 ** len(pairs)):
        yield [x for m, x in pairs if m & n]

After starting to use Python 2.6 I now use the following version (this
replaces a longer version I used in the past):

def xsubsets(seq):
    seq = list(seq)
    yield []
    for nitems in xrange(1, len(seq)+1):
        for subset in combinations(seq, nitems):
            yield subset

Thanks to the very fast combinations() this xsubsets code is faster
than that recipe by Eric Raymond, but it has the disadvantage that it
doesn't yield subsets in the natural "binary" order. So a C
implementation can be done to respect that natural order.

Such "binary" order has a disadvantage, because the lists/tuples it
yields have a different length, so the C code may have troubles in
using the the nice trick it currently use, re-using the same memory
and avoiding a new memory allocation every loop:

/* Copy the previous result tuple or re-use it if available */
if (Py_REFCNT(result) > 1) {
	PyObject *old_result = result;
	result = PyTuple_New(r);
	if (result == NULL)
		goto empty;
	co->result = result;
	for (i=0; i<r ; i++) {
		elem = PyTuple_GET_ITEM(old_result, i);
			Py_INCREF(elem);
			PyTuple_SET_ITEM(result, i, elem);
	}
	Py_DECREF(old_result);
}
/* Now, we've got the only copy so we can update it in-place
 * CPython's empty tuple is a singleton and cached in
 * PyTuple's freelist.
 */
assert(r == 0 || Py_REFCNT(result) == 1);

Maybe there are ways to avoid that in the C implementation of xsubsets
too, I don't know Python implementation enough to tell...

Bye,
bearophile