N-grams

Thu Nov 10 03:01:15 EST 2016

On Thursday 10 November 2016 17:53, Wolfram Hinderer wrote:

[...]
> 1. The startup looks slightly ugly to me.
> 2. If n is large, tee has to maintain a lot of unnecessary state.

But n should never be large.

If practice, n-grams are rarely larger than n=3. Occasionally you might use n=4 
or even n=5, but I can't imagine using n=20 in practice, let alone the example 
you show of n=500.

See, for example: 

http://stackoverflow.com/a/10382221

In practice, large n n-grams run into three problems:

- for word-based n-grams, n=3 is about the maximum needed;

- for other applications, n can be moderately large, but as n-grams are a
  kind of auto-correlation function, and few data sets are auto-correlated
  *that* deeply, you still rarely need large values of n;

- there is the problem of sparse data and generating a good training corpus.

For n=10, and just using ASCII letters (lowercase only), there are 26**10 = 
141167095653376 possible 10-grams. Where are you going to find a text that 
includes more than a tiny fraction of those?

-- 
Steven
299792.458 km/s — not just a good idea, it’s the law!