Language detection module?
Fernando Pereira
pereira at research.att.com
Thu Oct 21 23:50:30 EDT 1999
In article <slrn80v4cb.49r.thantos at chancel.org>, Alexander Williams
<thantos at chancel.org> wrote:
> On Thu, 21 Oct 1999 15:53:43 +0200, Dinu C. Gherman
> <gherman at darwin.in-berlin.de> wrote:
>
> >is there anything already like a function that I can pass an
> >arbitrary string and it will tell me wether it is written in
> >English, French, German, etc.?
>
> The method you use below is part of what I call the 'fast, dumb and
> happy' method of algorithms. :) (No shame there, I use it all the
> time.) It takes careful crafting, but its algorithmically simple. If
> you want something a bit more robust ...
>
> Take a 2 - 6 character sliding window, then snip the file into bits.
> Don't worry about capitalization or punctuation, take the document
> raw. Extract these Ngrams and create a sort of vector of them, each
> Ngram valued with its occurances. Repeat for a large corpus of
> documents in different languages. Now, begin clustering the documents
> based on the nearness of other documents in Ngrammatic space. You'll
> find all the documents of a given language tend to hang together (not
> the least reason for which is that they tend to use the same phrases
> other languages don't). As a side effect, you'll likely cluster
> documents about similar subjects together, but don't mind that right
> now. :)
The standard way of doing this doesn't involve clustering, which is
hard, provided that you have training samples for each target language.
>From the training sample for language L, one builds a "language model"
M[L] that estimates the probability M[L](S) of any string S according
to the language. Given a test string T, guess its language to be the L
such that M[L](T) is highest. One can build language models in many
ways, but most of the simple ones involve n-gram statistics. There are
certain subtleties on how to deal with test strings containing n-grams
not found the training sample for some language. For details, consult
@book{Bell+Cleary+Witten:90,
author = {Timothy C. Bell and John G. Cleary and Ian H. Witten},
title = {Text Compression},
publisher = {Prentice Hall},
address = {Englewood Cliffs, New Jersey},
year = 1990,
}
@book{Manning+Schuetze:99,
author = {Christopher D. Manning and Hinrich Sch\"{u}tze},
title = {Foundations of Statistical Natural Language Processing},
publisher = {{MIT} Press},
year = 1999,
address = {Cambridge, Massachusetts},
}
-- F
More information about the Python-list
mailing list