N-grams

Ian Kelly ian.g.kelly at gmail.com
Wed Nov 9 11:43:01 EST 2016


On Wed, Nov 9, 2016 at 6:38 AM, Steve D'Aprano
<steve+python at pearwood.info> wrote:
> And here's an implementation for arbitrary n-grams:
>
>
> def ngrams(iterable, n=2):
>     if n < 1:
>         raise ValueError
>     t = tee(iterable, n)
>     for i, x in enumerate(t):
>         for j in range(i):
>             next(x, None)
>     return zip(*t)
>
>
> Can we do better, or is that optimal (for any definition of optimal that you
> like)?

The tee object uses O(n) space, which I believe is unavoidable.
Time-wise it's O(n^2 + nm) = O(nm) where m is the size of the
iterable. O(nm) is also the total size of the output, so that's also
unavoidable with the assumption that the caller is going to consume
the entire output.

If that assumption doesn't hold, then perhaps this could be improved
by returning an iterator of iterators rather than of tuples (with the
proviso that the contents of the sub-iterator become unavailable once
the main iterator has been advanced; itertools.groupby works
similarly).



More information about the Python-list mailing list