Chardet, file, ... and the Flexible String Representation

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Sep 6 06:57:48 EDT 2013


On Fri, 06 Sep 2013 02:11:56 -0700, wxjmfauth wrote:

> Short comment about the "detection" tools from a previous discussion.
> 
> The tools supposed to detect the coding scheme are all working with a
> simple logical mathematical rule:
> 
> p  ==> q    <==>   non q  ==> non p .

Incorrect.

chardet does a statistical analysis of the bytes, and tries to guess what 
language they are likely to come from. The algorithm is described here:

https://github.com/erikrose/chardet/blob/master/docs/how-it-works.html

(although that's rather inconvenient to read), and here:

http://www-archive.mozilla.org/projects/intl/
UniversalCharsetDetection.html


chardet is a Python port of the Mozilla charset guesser, so they use the 
same algorithm.


> Shortly  -- and consequence  --  they do not detect a coding scheme they
> only detect "a" possible coding schme.

That at least is correct.


> The Flexible String Representation has conceptually to face the same
> problem. 

No it doesn't.


-- 
Steven



More information about the Python-list mailing list