Language detection module?

Thu Oct 21 23:50:30 EDT 1999

In article <slrn80v4cb.49r.thantos at chancel.org>, Alexander Williams
<thantos at chancel.org> wrote:

> On Thu, 21 Oct 1999 15:53:43 +0200, Dinu C. Gherman
> <gherman at darwin.in-berlin.de> wrote:
> 
> >is there anything already like a function that I can pass an
> >arbitrary string and it will tell me wether it is written in
> >English, French, German, etc.? 
> 
> The method you use below is part of what I call the 'fast, dumb and
> happy' method of algorithms.  :)  (No shame there, I use it all the
> time.)  It takes careful crafting, but its algorithmically simple.  If
> you want something a bit more robust ...
> 
> Take a 2 - 6 character sliding window, then snip the file into bits.
> Don't worry about capitalization or punctuation, take the document
> raw.  Extract these Ngrams and create a sort of vector of them, each
> Ngram valued with its occurances.  Repeat for a large corpus of
> documents in different languages.  Now, begin clustering the documents
> based on the nearness of other documents in Ngrammatic space.  You'll
> find all the documents of a given language tend to hang together (not
> the least reason for which is that they tend to use the same phrases
> other languages don't).  As a side effect, you'll likely cluster
> documents about similar subjects together, but don't mind that right
> now.  :)
The standard way of doing this doesn't involve clustering, which is
hard, provided that you have training samples for each target language.
>From the training sample for language L, one builds a "language model"
M[L] that estimates the probability M[L](S) of any string S according
to the language. Given a test string T, guess its language to be the L
such that M[L](T) is highest. One can build language models in many
ways, but most of the simple ones involve n-gram statistics. There are
certain subtleties on how to deal with test strings containing n-grams
not found the training sample for some language. For details, consult

@book{Bell+Cleary+Witten:90,
   author =    {Timothy C. Bell and John G. Cleary and Ian H. Witten},
   title =     {Text Compression},
   publisher = {Prentice Hall},
   address = {Englewood Cliffs, New Jersey},
   year =      1990,
}

@book{Manning+Schuetze:99,
   author =    {Christopher D. Manning and Hinrich Sch\"{u}tze},
   title =     {Foundations of Statistical Natural Language Processing},
   publisher = {{MIT} Press},
   year =      1999,
   address =   {Cambridge, Massachusetts},
}

-- F