N-grams

Wolfram Hinderer wolfram.hinderer at web.de
Thu Nov 10 01:53:40 EST 2016


Am 10.11.2016 um 03:06 schrieb Paul Rubin:
> This can probably be cleaned up some:
>
>      from itertools import islice
>      from collections import deque
>
>      def ngram(n, seq):
>          it = iter(seq)
>          d = deque(islice(it, n))
>          if len(d) != n:
>              return
>          for s in it:
>              yield tuple(d)
>              d.popleft()
>              d.append(s)
>          if len(d) == n:
>              yield tuple(d)
>
>      def test():
>          xs = range(20)
>          for a in ngram(5, xs):
>              print a
>
>      test()
That's along the lines of what I thought. Steven's version has two areas 
that might be possible to be improved:

1. The startup looks slightly ugly to me.
2. If n is large, tee has to maintain a lot of unnecessary state.

collections.deque manages all the state we need. Here are Steve's 
version (ngram) and mine (ngram_2):

from itertools import tee, islice
from collections import deque

def ngrams(iterable, n=2):
     if n < 1:
         raise ValueError
     t = tee(iterable, n)
     for i, x in enumerate(t):
         for j in range(i):
             next(x, None)
     return zip(*t)

def ngrams_2(iterable, n=2):
     if n < 1:
         raise ValueError
     it = iter(iterable)
     d = deque(islice(it, n-1), maxlen=n)
     for elem in it:
         d.append(elem)
         yield tuple(d)

print(list(ngrams(range(1000), 4)) == list(ngrams_2("abcdefg", 4)))


One problem my version has, is that it does the iteration over the 
iterable itself, so that's probably more python code (instead of C code 
in Steven's version). For large n the reduced number of iterators does 
pay off, though:

%timeit list(ngrams(range(1000), n=500))
10 loops, best of 3: 26.5 ms per loop

%timeit list(ngrams_2(range(1000), n=500))
100 loops, best of 3: 4.07 ms per loop

For small n, it's slower, as expected:

%timeit list(ngrams(range(1000), n=3))
10000 loops, best of 3: 120 µs per loop

%timeit list(ngrams_2(range(1000), n=3))
1000 loops, best of 3: 603 µs per loop





More information about the Python-list mailing list