[Python-Dev] Encoding detection in the standard library?

Tue Apr 22 23:34:20 CEST 2008

On 2008-04-22 18:33, Bill Janssen wrote:
> The 2002 paper "A language and character set determination method
> based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
> Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
> about this. 

Thanks for the reference.

Looks like the existing research on this just hasn't made it into the
mainstream yet.

Here's their current project: http://www.language-observatory.org/
Looks like they are focusing more on language detection.

Another interesting paper using n-grams:
"Language Identification in Web Pages" by Bruno Martins and Mário J. Silva
http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf

And one using compression:
"Text Categorization Using Compression Models" by   	
Eibe Frank, Chang Chui, Ian H. Witten
http://portal.acm.org/citation.cfm?id=789742

> They're looking at "LSE"s, language-script-encoding
> triples; a "script" is a way of using a particular character set to
> write in a particular language.
> 
> Their system has these requirements:
> 
> R1. the response must be either "correct answer" or "unable to detect"
>     where "unable to detect" includes "other than registered" [the
>     registered set of LSEs];
> 
> R2. Applicable to multi-LSE texts;
> 
> R3. never accept a wrong answer, even when the program does not have
>     enough data on an LSE; and
> 
> R4. applicable to any LSE text.
> 
> So, no wrong answers.
> 
> The biggest disadvantage would seem to be that the registration data
> for a particular LSE is kind of bulky; on the order of 10,000
> shift-codons, each of three bytes, about 30K uncompressed.
> 
> http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf

For a server based application that doesn't sound too large.

Unless you're using a very broad scope, I don't think that
you'd need more than a few hundred LSEs for a typical
application - nothing you'd want to put in the Python stdlib,
though.

> Bill
> 
>>> IMHO, more research has to be done into this area before a
>>> "standard" module can be added to the Python's stdlib... and
>>> who knows, perhaps we're lucky and by the time everyone is
>>> using UTF-8 anyway :-)
>> I walked over to our computational linguistics group and asked.  This
>> is often combined with language guessing (which uses a similar
>> approach, but using characters instead of bytes), and apparently can
>> usually be done with high confidence.  Of course, they're usually
>> looking at clean texts, not random "stuff".  I'll see if I can get
>> some references and report back -- most of the research on this was
>> done in the 90's.
>>
>> Bill

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611