Exended ASCII and code pages [was Re: for / while else doesn't make sense]

Steven D'Aprano steve at pearwood.info
Thu May 26 23:35:11 EDT 2016


On Fri, 27 May 2016 07:12 am, Marko Rauhamaa wrote:

> However, I must correct myself slightly: ASCII refers to any
> byte-oriented character encoding scheme *largely coinciding with ASCII
> proper*. But since all of them *are* derivatives of ASCII proper,
> mentioning is somewhat redundant.

"All" of them?


Here is a small selection of codecs provided by Python:

py> codecs = "cp037 cp273 cp500 cp875 cp1026 cp1140 utf_16be".split()
py> for cd in codecs:
...     print("ab.12".encode(cd))  # ASCII gives b'ab.12'
...
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x00a\x00b\x00.\x001\x002'


There's also at least one other double-byte character set which, as far as I
can tell, isn't supported by Python: KS X 1001, used in Korea.

Then there are the variable-width encodings which are backwards compatible
with ASCII only in the sense that text containing *only* ASCII characters
uses the same sequence of bytes as ASCII would. But being variable-width,
they cannot be treated as a simple array of bytes with a fixed 1 byte = 1
character mapping. Examples include UTF-8, UTF-7, the various Shift-JIS
encodings, EUC-JP, EUC-KR, EUC-TW, GB18030, Big5, and others.

This concept of ASCII = "all character sets", or "nearly all", or "okay,
maybe not nearly all of them, but just the important ones" is terribly
Euro-centric. The very idea would be laughable in Japan and other East
Asian countries, where Shift-JIS and Big5 still dominate.

So please, open your mind to the reality of computing outside of Europe.
ASCII-based encodings no more encompasses all of the world's natural
languages (not even the "important" ones) than "everyone is using Internet
Explorer and Windows XP, right?" describes the state of the Internet.




-- 
Steven




More information about the Python-list mailing list