[Python-Dev] Encoding detection in the standard library?
M.-A. Lemburg
mal at egenix.com
Tue Apr 22 23:34:20 CEST 2008
On 2008-04-22 18:33, Bill Janssen wrote:
> The 2002 paper "A language and character set determination method
> based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
> Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
> about this.
Thanks for the reference.
Looks like the existing research on this just hasn't made it into the
mainstream yet.
Here's their current project: http://www.language-observatory.org/
Looks like they are focusing more on language detection.
Another interesting paper using n-grams:
"Language Identification in Web Pages" by Bruno Martins and Mário J. Silva
http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf
And one using compression:
"Text Categorization Using Compression Models" by
Eibe Frank, Chang Chui, Ian H. Witten
http://portal.acm.org/citation.cfm?id=789742
> They're looking at "LSE"s, language-script-encoding
> triples; a "script" is a way of using a particular character set to
> write in a particular language.
>
> Their system has these requirements:
>
> R1. the response must be either "correct answer" or "unable to detect"
> where "unable to detect" includes "other than registered" [the
> registered set of LSEs];
>
> R2. Applicable to multi-LSE texts;
>
> R3. never accept a wrong answer, even when the program does not have
> enough data on an LSE; and
>
> R4. applicable to any LSE text.
>
> So, no wrong answers.
>
> The biggest disadvantage would seem to be that the registration data
> for a particular LSE is kind of bulky; on the order of 10,000
> shift-codons, each of three bytes, about 30K uncompressed.
>
> http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf
For a server based application that doesn't sound too large.
Unless you're using a very broad scope, I don't think that
you'd need more than a few hundred LSEs for a typical
application - nothing you'd want to put in the Python stdlib,
though.
> Bill
>
>>> IMHO, more research has to be done into this area before a
>>> "standard" module can be added to the Python's stdlib... and
>>> who knows, perhaps we're lucky and by the time everyone is
>>> using UTF-8 anyway :-)
>> I walked over to our computational linguistics group and asked. This
>> is often combined with language guessing (which uses a similar
>> approach, but using characters instead of bytes), and apparently can
>> usually be done with high confidence. Of course, they're usually
>> looking at clean texts, not random "stuff". I'll see if I can get
>> some references and report back -- most of the research on this was
>> done in the 90's.
>>
>> Bill
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Apr 22 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
More information about the Python-Dev
mailing list