N-grams

Thu Nov 10 00:50:40 EST 2016

On Wed, Nov 9, 2016 at 7:06 PM, Paul Rubin <no.email at nospam.invalid> wrote:
> This can probably be cleaned up some:

Okay. :-)

>     from itertools import islice
>     from collections import deque
>
>     def ngram(n, seq):

Looks like "seq" can be any iterable, not just a sequence.

>         it = iter(seq)
>         d = deque(islice(it, n))

I'd use the maxlen argument to deque here.
https://docs.python.org/3/library/collections.html#collections.deque

>         if len(d) != n:
>             return
>         for s in it:
>             yield tuple(d)
>             d.popleft()
>             d.append(s)

The ordering here means that at any given time, one more element will
have been consumed from the source iterator than has been yielded.
That's minorly inefficient and also could cause confusion if seq was
an iterator to begin with. Better to move the extra yield above the
loop and reorder the loop body so that the yielded tuple includes the
element just read.

>         if len(d) == n:

As written, I don't see how this would ever be false. The length of d
should still be the same as in the previous check.

>             yield tuple(d)
>
>     def test():
>         xs = range(20)
>         for a in ngram(5, xs):
>             print a
>
>     test()

What makes this better than the tee version?