More charset troubles (Re: Codecs for ISO 8859-11 (Thai) and 8859-16 (Romanian))
"Martin v. Löwis"
martin at v.loewis.de
Tue Aug 3 08:02:52 EDT 2004
Peter Jacobi wrote:
> Looking around:
> - the RFC references a fixed year old version
> - Unicode mapping files and libiconv track the newest version
> - IBM ICU4C provides all versions
> - Python (not by planning, I assume) has a "middle" version with
> some features of the old mapping table (no currency signs) and some
> features of the new (0xA1=0x2018, 0xA2=0x2019)
Indeed. Adding new codecs is not a matter of just compiling a few files
that somebody else has produced, but requires a lot of expertise.
Therefore, I would have preferred if Python would not have included any
codecs, but relied on the codecs that come with the platform (e.g. iconv
on Unix, IE DLLs on Windows).
Now, things came out differently, and we are now in charge of
maintaining what we got. This requires great care, and expert volunteers
are always welcome. Unfortunately, in the Unicode/character sets/l10n
world, there is no one true way, so experts need to stand up and voice
their opinion, hoping that contributors become atleast aware of the
issues.
In the specific case of ISO-8859-7, I was until just now unaware of the
issue - I would not have guessed that ISO dared to ever change a part
of 8859. If this is ever going to be changed, I would suggest the
following approach:
- provide two encodings: ISO-8859-7:1987, and ISO-8859-7:2003. Without
checking, I would hope that the version in RFC 1345 is identical with
8859-7:1987
- Make ISO-8859-7 an alias for ISO-8859-7:1987
Of course, somebody should really talk to IANA and come up with
preferred MIME name. Apparently, ISO-8859-7-EURO and ISO-8859-7-2003
have been proposed.
Regards,
Martin
More information about the Python-list
mailing list