Catogorising strings into random versus non-random

Mon Dec 21 12:20:01 EST 2015

Steven D'Aprano <steve at pearwood.info> writes:
> Does anyone have any suggestions for how to do this? Preferably something
> already existing. I have some thoughts and/or questions:

I think I'd just look at the set of digraphs or trigraphs in each name
and see if there are a lot that aren't found in English.

> - I think nltk has a "language detection" function, would that be suitable?
> - If not nltk, are there are suitable language detection libraries?

I suspect these need longer strings to work.

> - Is this the sort of problem that neural networks are good at solving?
> Anyone know a really good tutorial for neural networks in Python?
> - How about Bayesian filters, e.g. SpamBayes?

You want large training sets for these approaches.