Chardet, file, ... and the Flexible String Representation

wxjmfauth at gmail.com wxjmfauth at gmail.com
Fri Sep 6 05:11:56 EDT 2013


Short comment about the "detection" tools from a previous
discussion.

The tools supposed to detect the coding scheme are all
working with a simple logical mathematical rule:

p  ==> q    <==>   non q  ==> non p .

Shortly  -- and consequence  --  they do not detect a
coding scheme they only detect "a" possible coding schme.


The Flexible String Representation has conceptually to
face the same problem. It splits "unicode" in chunks and
it has to solve two problems at the same time, the coding
and the handling of multiple "char sets". The problem?
It fails.
"This poor Flexible String Representation does not succeed
to solve the problem it create itsself."

Workaround: add more flags (see PEP 3xx.)

Still thinking "mathematics" (limit). For a given repertoire
of characters one can assume that every char has its own
flag (because of the usage of multiple coding schemes).
Conceptually, one will quickly realize, at the end, that they
will be an equal amount of flags and an amount of characters
and the only valid solution it to work with a unique set of
encoded code points, where every element of this set *is*
its own flag.
Curiously, that's what the utf-* (and btw other coding schemes
in the byte string world) are doing (with plenty of other
advantages).

Already said. An healthy coding scheme can only work with
a unique set of encoded code points. That's why we have to
live today with all these coding schemes.

jmf




More information about the Python-list mailing list