[Python-ideas] Python 3000 TIOBE -3%

Fri Feb 17 04:42:25 CET 2012

Terry Reedy writes:

 > Before unicode, mixed encodings was the only was to have multi-lingual 
 > digital text (with multiple symbol systems) in one file.

There is a long-accepted standard for doing this, ISO 2022.  IIRC it's
available online from ISO now, and if not, ECMA 35 is the same.  The X
Compound Text standard (I think this is documented in the ICCCM) and
the Motif Compound String are profiles of ISO 2022.

If that is what Paul is seeing, then the iso-2022-jp codec might be
good enough to decode the files he has, depending on which version of
ISO-2022-JP is implemented.  If not, iconv -f ISO-2022-JP-2 (or
ISO-2022-JP-3) should work (at least for GNU's iconv implementation).

 > I presume such texts used some sort of language markup like
 > <English>, <Hindi> (or <Sanskrit>), and <Tibetan>, along with
 > software that understood the markup.

They would use encoding "markup" (specifically escape sequences).
Language is not enough, as all languages have had multiple encodings
since the invention of ASCII (or EBCDIC, whichever came second ;-),
and in many cases multilingual standards have evolved (Japanese, for
example, includes Greek and Cyrillic alphabets in its JIS standard
coded character set).  More recently, many languages have several ISO
2022-based encodings (the ISO 8859 family is a conformant profile of
ISO 2022, as are the EUC encodings for Asian languages; the Windows
125x code pages are non-conformant extensions of ASCII based on ISO
8859).

 > Crazy text that switches among unknown encodings without notice is a 
 > possibly unsolvable decryption problem.

True, and occasionally seen even today in Japan (cat(1) will produce
such files easily, and any system for including files).