Changing filenames from Greeklish => Greek (subprocess complain)

jmfauth wxjmfauth at gmail.com
Mon Jun 10 15:48:08 EDT 2013


-----

A coding scheme works with three sets. A *unique* set
of CHARACTERS, a *unique* set of CODE POINTS and a *unique*
set of ENCODED CODE POINTS, unicode or not.

The relation between the set of characters and the set of the
code points is a *human* table, created with a sheet of paper
and a pencil, a deliberate choice of characters with integers
as "labels".

The relation between the set of the code points and the
set of encoded code points is a "mathematical" operation.

In the case of an "8bits" coding scheme, like iso-XXX,
this operation is a no-op, the relation is an identity.
Shortly: set of code points == set of encoded code points.

In the case of unicode, The Unicode consortium endorses
three such mathematical operations called UTF-8, UTF-16 and
UTF-32 where UTF means Unicode Transformation Format, a
confusing wording meaning at the same time, the process
and the result of the process. This Unicode Transformation does
not produce bytes, it produces words/chunks/tokens of *bits* with
lengths 8, 16, 32, called Unicode Transformation Units (from this
the names UTF-8, -16, -32). At this level, only a structure has
been defined (there is no computing). Very important, an healthy
coding scheme works conceptually only with this *unique" set
of encoded code points, not with bytes, characters or code points.

The last step, the machine implementation: it is up to the
processor, the compiler, the language to implement all these
Unicode Transformation Units with of course their related
specifities: char, w_char, int, long, endianess, rune (Go
language), ...

Not too over-simplified or not too over-complicated and enough
to understand one, if not THE, design mistake of the flexible
string representation.

jmf




More information about the Python-list mailing list