N-grams

Steve D'Aprano steve+python at pearwood.info
Wed Nov 9 08:38:05 EST 2016


The documentation for the itertools has this nice implementation for a fast
bigram function:

from itertools import tee

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


https://docs.python.org/3/library/itertools.html#itertools-recipes


Which gives us an obvious trigram and 4-gram implementation:

def trigram(iterable):
    a, b, c = tee(iterable, 3)
    next(b, None)
    next(c, None); next(c, None)
    return zip(a, b, c)

def four_gram(iterable):
    a, b, c, d = tee(iterable, 4)
    next(b, None)
    next(c, None); next(c, None)
    next(d, None); next(d, None); next(d, None)
    return zip(a, b, c, d)


And here's an implementation for arbitrary n-grams:


def ngrams(iterable, n=2):
    if n < 1:
        raise ValueError
    t = tee(iterable, n)
    for i, x in enumerate(t):
        for j in range(i):
            next(x, None)
    return zip(*t)


Can we do better, or is that optimal (for any definition of optimal that you
like)?




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list