Chardet, file, ... and the Flexible String Representation
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Sep 6 06:57:48 EDT 2013
On Fri, 06 Sep 2013 02:11:56 -0700, wxjmfauth wrote:
> Short comment about the "detection" tools from a previous discussion.
>
> The tools supposed to detect the coding scheme are all working with a
> simple logical mathematical rule:
>
> p ==> q <==> non q ==> non p .
Incorrect.
chardet does a statistical analysis of the bytes, and tries to guess what
language they are likely to come from. The algorithm is described here:
https://github.com/erikrose/chardet/blob/master/docs/how-it-works.html
(although that's rather inconvenient to read), and here:
http://www-archive.mozilla.org/projects/intl/
UniversalCharsetDetection.html
chardet is a Python port of the Mozilla charset guesser, so they use the
same algorithm.
> Shortly -- and consequence -- they do not detect a coding scheme they
> only detect "a" possible coding schme.
That at least is correct.
> The Flexible String Representation has conceptually to face the same
> problem.
No it doesn't.
--
Steven
More information about the Python-list
mailing list