[Python-Dev] Encoding detection in the standard library?

Tue Apr 22 22:54:35 CEST 2008

[CCing python-dev again]

On 2008-04-22 12:38, Greg Wilson wrote:
>>> I don't think that should be part of the standard library. People
>>> will mistake what it tells them for certain.
>>> [etc]
> 
> These are all good arguments, but the fact remains that we can't control 
> our inputs (e.g., we're archiving mail messages sent to lists managed by 
> DrProject), and some of those inputs *don't* tell us how they're encoded.
> Under those circumstances, what would you recommend?

I haven't done much research into this, but in general, I think it's
better to:

  * first try to look at other characteristics of a text
    message, e.g. language, origin, topic, etc.,

  * then narrow down the number of encodings which could apply,

  * rank them to try to avoid ambiguities and

  * then try to see what percentage of the text you can decode using
    each of the encodings in reverse ranking order (ie. more specialized
    encodings should be tested first, latin-1 last).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611