More charset troubles (Re: Codecs for ISO 8859-11 (Thai) and 8859-16 (Romanian))

"Martin v. Löwis" martin at v.loewis.de
Tue Aug 3 08:02:52 EDT 2004


Peter Jacobi wrote:
> Looking around:
> - the RFC references a fixed year old version
> - Unicode mapping files and libiconv track the newest version
> - IBM ICU4C provides all versions
> - Python (not by planning, I assume) has a "middle" version with
> some features of the old mapping table (no currency signs) and some
> features of the new (0xA1=0x2018, 0xA2=0x2019)

Indeed. Adding new codecs is not a matter of just compiling a few files
that somebody else has produced, but requires a lot of expertise.
Therefore, I would have preferred if Python would not have included any
codecs, but relied on the codecs that come with the platform (e.g. iconv
on Unix, IE DLLs on Windows).

Now, things came out differently, and we are now in charge of
maintaining what we got. This requires great care, and expert volunteers
are always welcome. Unfortunately, in the Unicode/character sets/l10n
world, there is no one true way, so experts need to stand up and voice
their opinion, hoping that contributors become atleast aware of the
issues.

In the specific case of ISO-8859-7, I was until just now unaware of the
issue - I would not have guessed that ISO dared to ever change a part
of 8859. If this is ever going to be changed, I would suggest the 
following approach:
- provide two encodings: ISO-8859-7:1987, and ISO-8859-7:2003. Without
   checking, I would hope that the version in RFC 1345 is identical with
   8859-7:1987
- Make ISO-8859-7 an alias for ISO-8859-7:1987
Of course, somebody should really talk to IANA and come up with
preferred MIME name. Apparently, ISO-8859-7-EURO and ISO-8859-7-2003
have been proposed.

Regards,
Martin



More information about the Python-list mailing list