[Python-Dev] Encoding detection in the standard library?
M.-A. Lemburg
mal at egenix.com
Tue Apr 22 22:54:35 CEST 2008
[CCing python-dev again]
On 2008-04-22 12:38, Greg Wilson wrote:
>>> I don't think that should be part of the standard library. People
>>> will mistake what it tells them for certain.
>>> [etc]
>
> These are all good arguments, but the fact remains that we can't control
> our inputs (e.g., we're archiving mail messages sent to lists managed by
> DrProject), and some of those inputs *don't* tell us how they're encoded.
> Under those circumstances, what would you recommend?
I haven't done much research into this, but in general, I think it's
better to:
* first try to look at other characteristics of a text
message, e.g. language, origin, topic, etc.,
* then narrow down the number of encodings which could apply,
* rank them to try to avoid ambiguities and
* then try to see what percentage of the text you can decode using
each of the encodings in reverse ranking order (ie. more specialized
encodings should be tested first, latin-1 last).
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Apr 22 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
More information about the Python-Dev
mailing list